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1 Executive Summary 

The study is concerned with three objectives: 

1. Proposal of a hardware framework, which enables an efficient integration of the PACT XPP 
core into a standard RISC processor architecture. ^ 

2. Proposal of a compiler for the coupled RISC+XPP hardware. This compiler decides 
~^C^* W ^ 18 — °" - MSC «™ anTwhLhtartt 

3. Presentation of a number of case studies demonstrating which results may be achieved hv 
usmg the proposed C Compiler in cooperation with the proposed hardware frLework * 

The proposed hardware framework accelerates the XPP core in two respects First Hat* tw^v, + • 
mcraased by raising the XPFs internal operating frequency into m X of he ^SSS^ 

s?^js^ p sector takes piace ww,e ^ is ° peratin ^ « ^ ^ 

i7 rabb e f f f ng with a coupled RISC+XPP hardware is concerned with the RISC's 
^Ititaskmg concept It becomes necessary to interrupt computations on the XPP in order to perform a 

S&T I ™ t mg iS SU ? P ° rted by pr ° pOSed Com P iler > ™ wel1 *» by me proposed h^dZe 
First, each XPP configuration is considered as an uninterruptible entity. This meanXt the comoUer 
which generates the configurations, takes care that the execution time of any configuration doTnot 
S? d ft te Slice " S ™»«>1 ?» «*■ controller is concerned with^eSg Z rtstorLg 

nlmJfb *Jl m "TP 1 The pr °P° sed cache conce Pt minimizes the memo™ traffic™? 
interrupt handlmg and frequently even allows avoiding memory accesses at all. 

SSwi-JT EJTi ° aChe °° nCept iS b3Sed ° n a sim P le IRAM cel1 allowing for an easv 

SSJSr^ * e m cache si - for - s 

The study proposes a compiler for a RISC+XPP system. The objective of the compiler is that real 
world apphcahons, winch are written in the C language, can be compiled for a WSC+X^P svsSm 
The compter removes the necess ty of developing NML code for the XPP by hand UtTpos S' 

?^SS^ C ° mPller * tee m *° r ~ Ponents to 

1 . partitioning of the C source code into RISC and XPP parts, 

2. transformations to optimize the code for the XPP and 
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3. generating NML code. 
Finally the generated NML code is placed and routed for the XPP. 

The partitioning component of the compiler' decides which parts of an annlimtinn ^ 
executed on the XPP and which parts are executed on the RISC T\S Jl ? u . Can be 

code are loops with a large nlber of «J£ ""wtj Hoop^S ~ ^^2^ P 
operates. TTta remaining source code - including the data transfer code -TcTpStr L ^TSC 

= p h^^^ 
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2 Hardware 



2.1 Design Parameter Changes 

Since the XPP core shall be integrated as a functional unit into a standard RT«5r ,™ 

parameters have to be reconsidered: standard RISC core, some system 

2.1.1 Pipelining, Concurrency and Synchronicity 

RISC instructions of totally different type (Ld/St, ALU. Mul/Div/MAC FPATTT mm , ^ 
executed in separate specialized functional units to increase 2 WtJ™ ' f U> u FPMuI -> are 

unit - .« * «r«5i£s 

different stages of the pipeline -Thus dftS ^f. ? d ? tae * ™ hich execu,eli ™ 
P-D- with each staging .^^1^ ST^^r - 

manyeyeles, uniik. aTteZnSS^ n^^CeS " ?° ??f 8n rati<,n "etive for 

two eyeles at most. Therefore it is st«l worfowhife "If ^ pioally execute f ° r <""> « 

* an™^ execution scheme and rather leads 

existence of explicit sync^nStfon ' ^ ""^ ™ S * *"* neCeSSitates * e 

2.1.2 Core Frequency and the Memory Hierarchy 

As a functional unit, the XPP's operating frequency will either be half nf th* „ ^ 
to the core frequency of the RISC Almost ever? RT^r ™ f ° 0re or equal 

memory bus fluency with its coretquencyb^ T^Z^r^Z^LT^ — 7" 
forming what is commonly called the memory hierarclrv XJS" ™ er€ f r ? f ches are employed, 
its predecessors. memory hierarchy. Each layer of cache is larger but slower than 
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other types of companions »ith mo^T^Cl^^f* C™^ bandwidth". However 
long as they fit into one of the upper levers SL!™h~ u ^ ™ se) gam P^femenee as 
that gain the highest speed ups ^£^£^£52***- 18 *• <*- «* ^lioations 

^'^^ZlfZ^IZ^^^'^ * data so, too oig 
sooner exposes memory reuse on a ,^ 5 . 10 reuse sma,ler ^ «• 

^sof the memory hierarchy, the StT^ 



2.1.3 Software: Multitasking Operating Systems 

-« r ic^ f - ^gher frequency and the 

reevaluationofthLo^areLTron^en? reC ° nSlderatl ° n ° f hardware de ^ parameters, but also a 

Memory Hierarchy 

2S£^l a £?2 SSS"?^?^^- 1 -J that can be implemented 

manner, applying some calculations £ ?Sd^^? 80nlh r ^ ^ data Sets in a Iinear 
long as all of tta ^computation ^l?lL^2J?t^ , !S d "T* ^ baok t0 memo ^ 
applications are filtering and audio ^ZlZfL^rT " "» Typical 

" co mputational comp]exity ^ Wgher 

third dimension of data coherence o^s un ^ u ^ P rocessin & where a second and 
compression algorithm^ ThaHcan S^ln^SSSL" -g- exploited by picture and video 
consecutive pictures of a video streL fa tSSTtl2J^l ° Ven SearChin S 

algorithmic complexity as well as hiah^r ™T ibsao algorithms have a much higher 

design or they can beVansllS tot' S^SSS^S^^ ~ ^ ^ ^ 
higher clock frequencies of processors ^S^S^^ mem01 * hierarch y and the 



Multi Tasking 



environments Whh multiJskW iS^^SSjTT?" S° """^ ° perated m ^itasking 
basis, thus simulating concS e^ecS^ 

operating system has to s^cltat^L^T _ a PP hcat ! ons To switch tasks* the 

reload the state of another thence Is nTce SSa ^ ole %** ""^ task "* *» 

and to keep it as small as possible to ^JSSS^S^S^ *" ^ ° f ^ * 

~ ed - and deeply 

hierarchies mean that thSe is a high JSjS cTct misses' due to iSTST* m<aM * 
memory frequency. Many core cycles pass until the v» Wc r ?, d f f ence between core and 
pipelines incur pipeline stalls due to date d~.T available from memory. Deep 

conditional branches. SpSi^^ 

P^-Forthesereasons^^ ldle f ° r ****** 
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a dds har ,_ 

than one independent instruction stream to beSS Th»^2 8 ° f ^ eXp0Sin 8 more 

or doesn't utilize all functional unhT fee otter onJ f' Whe . nev fT one ^truction stream stalls 
utilization for today's processors. * ^ JUmp m ' 71x18 im P roves functional unit 

S-t^ so the processor state has to be 

combination of the PACT XPP and a standard RISr nm P « J?" 38 SmaI1 38 P ossib,e - F °r the 
configurations execute longer ^S^S^vST''^^ heneflci * l > the XPP 
other functional units, whife a o^S^ t ^ B ST\^'^ Can Utilize the 
the XPP, so whU e one such ^tt^ESZ Zlfi&Z ZSS^L^ 

2.2 Communication Between the RISC Core and the 
XPP Core. 

The following sections introduce several possible hardware implementations for accessing memory. 

2.2.1 Streaming 

weTsuffi 

outputs. As the pipelines take a S^T^JS^ 'SSSfT™ 8 -**- ^ feW 

limited (as described under '^J^S^O^^^ ^ ° f * confi ^ation is 

bigger XPP arrays and XPP frequencies^ear fee RKC core frequency UniCatl ° n *" "* We " to 

■ Streaming from the RISC core 

t^^^S^^7 r T*" ^ SinCe »BC core 
setup is only suited if s?™ ™ item from mem °^ this 

RISC core frequency. * data W,th a fr ^ncy much lower than the 

■ Streaming via DMA 

£2 £S °JZ XP^c^r 288 3 DMA ChannCl Which *« -PP»- the data items 

2.2.2 Shared Memory (Main Memory) 

L^cS of PAEs t o generate an address that is 

approach suffers from the Z^l^L^JZ '^ nUmbe , r ° f IO P°*s is very limited this 
impact of using PAEs for adZssSneSnTs JiS T' alth ° Ugh f ° r lar & r *** «W the 
loading values from very sparsTveS d ™n,shmg. However this approach is still useful for 

2.2.3 Shared Memory ((RAM) 

This data access mechanism uses the IRAM elements to stnrt» Hata f i i 

can erther be viewed as vector registers or as l^S^S^^! mpat * m ^ IRAMs 
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There are several ways to fill the IRAMs with data. 

1. The IRAMs are loaded in advance by a separate configuration using streaming 

"as IT^fZT^^r^o * chitect r, ™ e IRAMs act - -tor 

the IRAMs will always be par^ of th ° f Jl XPP es P ecia »y « 

restored on context switches eternally visible state and hence must be saved and 

2. The IRAMs can be loaded in advance by separate load-instructions 

^"in 0 ZZZrT c^^b ^ ***** *° «AM. are 

configuration. Therefore configuration reS T V,6Wed 38 hard cod ^ tort- 
-tructionsmayuseaJidSiSS^^^ *• special load 

Therefore a more efficient method than streaming can be used. 

3. The IRAMs can be loaded by a "burst oreloaH ft™ m 

controller. No configuration o^oad-uWctSnls neS 2 ° f *" cache 

implemented in the cache controller anTnTg^ed by tie m ^ ^ ^ load is 

act as vector registers and are therefore IncI^^^J^^ the IRAMs 

4. The best mode however is a combination of the previous solutions with the extension of a 

enough in advance and nHnte^for t^W ? *" Prel<>ad instruct i°«s are issued long 

consJne any time ^ 0r SWltCh destr °y s cache ^lity, the wait will no! 

To specify a memory block as output-onlv IRAM « 'wi n ,^ ~i „ • 

globally (Ml write-bac^/Tse^S^ bf™ -5? "T^ hferan:,,y - ™ s «" be d °™> 
accessed. * * l « cav «<y by specifymg (he memory area, whieh will be 



2.3 State of the XPP Core 

The state of the XPP core can be classified as 
1 Read only (instruction data) 

■ '°»o„da*. M n^ 
* Kead — Write 

• the contents of the data registers and latches of the PAEs, which are driven onto the busses 

• the contents of the IRAM elements mousses 
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2.3.1 Limiting Memory Traffic 

There are several possibilities to limit the amount of memory traffic during context switches. 

Do not save read-only data 

This avoids storing configuration data, since configuration data is read onlv The ... ♦ 

is simply overwritten by the new one. y " CUrrent colrf iguration 

Save less data 

from the IRAMs and writes Hn*„XVd\e t STf T*?- ^ ** i,,pU, *"* 

mfomanon in the PAEs md on JL- „ *" 

Save modified data only 

Use caching to reduce the memory traffic 

reduce the memory traffic for freauent context «w,t^lc au "ng the task switch. This cache can also 
repiaeemen, is tapIem eSt ^Z£SSS££2£~ ^ W ^ 

multiple IRAM instances. Only d,e starttag adSs o^he^M^ TT ^ WoCk With 

^^^^S^CXS^ °' - MWly — context, me 

™£E£2S£5" k aVa " able ' 3 *" <un " 0<lified) tas,mce is 

£ raTen^S " » * - -* 

2.4 Context Switches 

eon^, wne-e a,,, . a, ieas, *e ^^X^^^^^X 
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theorem task - the task's context - is exchanged with the state of another task, that will be executed 

2SSJ5ryST^ ZZ^t^T^T ° f VirtU£d Pr ° CeSSOrS With -ultaneous 
(ISR).and a T^k Switch HyperThreadmg), execution of an Interrupt Service Routine 



2.4.1 SMT Virtual Processor Switch 

o^-S^ !n hardware. Instructions 

parallelism and improve hmOS^S^S^R^S"" 1 ^ t0 ""^ inst ™*™ level 
reloaded from memory betwJtol^^^^iJSS the P ro + cessor state be stored to and 

of alternating instructions from tw^aTs ithtllf^ " T? S: haa * m *» WOrst case 
processor state to memory and read in aTtheTstat h ° USimd ° f CVC,eS needed to wite «» 

counter was used to fetch i£ZZ£j£% replicating^he stl'Uvt ST*' T u~ 
be mserted to select one of the different states, have to L switcned y mUltIpleXerS ' wh,ch hav * to 

£^st^^^^ SiHCOn ~ Deeded t0 impWnt SMT ' « the - <>f the 

2.4.2 Interrupt Service Routine 

The part of the state, which is destroyed by the iumD to the T<!tt ;« u u , 

*°^e^^ 

and the instructions of to ISR StinL ? instructions in the pipeline are finished, 

the instructions ^^Z^Z^^^^Tl *" '"l*™** late "<*- U - 

a g ain later, tvhich d^mS££^Z?Z£^ "ST* latency ' *** m " 8t **** 
"ns.thinthepipe^ 

2.4.3 Task Switch 

This type of context switch is executed totally in software a n «*■ « + i » 

saved to memory, and the context of the new Z£u»?Z k ^ f i* 8 C ° ntext (state > has to be 
to use all of tta%ce^?S££ to S^tS? n ^ 316 USUallv allowed 

saved and restored. If the ^oZTJLt^2!^^Tf "VS* 8 P^""" State has to be 
by less frequent rescheduling or a severe Context switches must be decreased 



12 



WO 2005/010632 

PCT/EP2004/006547 

2.5 A Load Store Architecture 

We propose an XPP integration as an asynchronously pipelined functional unit for th. Rrcr w 
further propose an explicitly preloaded cache for the IRAMs on t™, „f I. °' . C We 

within the MSC (as proposed as fourth metad in^to^ ^'"T 
explicitly preload conflguration cache wiMn Z PaSvI ™^ - ' * ^T^ 1 ^ 
configurations and fast switching between configuSationf * P * * SUPP0It Pre ' 0adin « of 

fSi?^^^ SSK by SrteT^.r- * ^ ?~ «*- — 

^wWoy^xSa^ 

e^ecE'SaXfc^hf^SIC" "V? EXe ° Ute "* (StOT * - 

hiding the dSSX^f^^K^M^ h - "*■*«» f-hion, ftua 

Instructions 

execute asynchronously. The XppSync Jon J b^Sf^^S^S?" mS,rUCt,0n 
XppPreloadConfig (void «ConflgnrattonStartAddress) 

Th^c^gumfion is added to the pre,oad FIFO to be loaded into the configunaion cache within the 

pr^iouT SPe0ala,iVe " rel0adS "** POSSible ' Sta ~ "»-*• P""*" —S overwrite the 

° m ° WSC P0inte H- — b tapHcitly contained in 

XppPreload (iut IRAM, void "StertAddress, tot Ske) 
XppPreloadClean (tot IRAM, void -StertAddress, tot Size) 

The first parameter is the IRAM number. This is an immediate (constant) value 
S&^SSlfi^ ^ ^ ™ S P ™ * ™ - • P— 

p^oTreSS: as j^sr ™ s ,s m — » — " « i— «■ 

The first variant actually preloads the data from memory 

back operation of the IRAM cache ™ 6y obv,ousI y "^ded for the write- 
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is actually effective, when the configuration Scuteu >' ^ °" ly ^ last preload 

XppExecute 0 

FrFO, if its start command has ahSIJSln^S " C ° nSUmed &0m *° head of 

XppSync (void *StartAddress, int Size) 

This inaction forces write-bach operates f„ r all IRAMs tha, overiap the given memory area 
SlSSSSIXS* * e -*« ™. P~ - provided in . pointtr register 

SSSStSS^fiT ^ ™ S h " ™- * >" « m^. register of , ne 

•his instruction can also be used ,„ £^',itI£ffifl!E 
Giving a size of zero can be used as a simple wait for the end of the configumtion. 
XppSave (void *StartAddress) 

This instruction saves the task context of the XPP to the given memory area. 

.^r^Ste ^ ™ S - " I"-** » Pointer register of 

XppRestore (void *StartAddress) 

This mstruction restores the task context of the XPP from the given memory area. 

SS^igSf St3rting ^ ™" <™ * P-idod in a pointer register of 

The size depends on the actual implementation of the XPP n n », M , • ^ 

operating system will use this uu^XZs^^^S^' * *" ** SChedUler ° f 

2.5.1 A Basic Implementation 
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instructions 

" 4 b^~:::::::rd I ===== 



RISC 
Ld / St 
unit- 
cache Ctrl 



~i— 




RISC 



XPP 
Ld / St 
unit- 
cache ctrl 




XPP 




Hgure 1: Memory interface 
The XPP core shares the memory hierarchy with the RISC core using a special cache controller. 



XPPPreloa<lConfig( XppCfg_foo ); 

for(int i=0; i <I0O0; { 

XPPPreload( 2, &a[i*30], 30 ); 

XPPPreload< 0, &b[i*200], 200 ); 

XPPPreloadClean( 5, &c[i*10], 10 ); 

XPPExecute( ); 
. /* 

Other RISC computations ... 

In the meanwhile the burst preloads and 

the previous configuration are running; 

The new configuration is executed as soon 

as the preloads and the previous 

configuration are finished. 

New burst preloads can be issued 

according to the FIFO length. 
./ 

) 

Note: in all places where constants are used, 
the value should actually come from a register 



Le gend: 



per thread state resource 

, wftte-'facfeif dfrtv- 'f 
volatile read only resource 




•> «w • >ff4.: (c^»-#*v-; 



if 
rv 



— ^^^^^mmm •'■*> ■ ■ ;f;,| r 

Figure 2 IRAM & configuration cache controller data structures and usage example 



n?!^ 10 ^"™ 03 ? * e ab ,° Ve figUrC COntain * e ^dresses and sizes for already issued IRAM 
— This way a. fcII o™ 6 ^ -^.SgOT^JS^S* 
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If no preload is issued for an IRAM before the configuration is executed, the preload of the orevionc 

Z^XSE^ * is not necessary to repeat identical ~ 



instructions 




Figure 3: Asynchronous pipeline of the XPP 



SiTS^ 5 aS t0 (stalled) UntiI a11 necessa ^ Preloads are 

3 « ex Plicitly by the use of a synchronization command or implicitly bv the cache 
controller. Hence the cache controller (XPP Ld/St unit) has to handle the symtoSSLSaiS^SS 
commands as we 1, actually starting the configuration as soon as all data 

as a single unit since they do not have different ^^^"S^S^S^^ 

oil?£l COn ? gUrat ° n fetch < CF >- °P erand fetch (OF) (IRAM preloaded ^SS^S^L 
of the XPP pipeline, also triggering the execute stage (EX) (PAE array). J g 

^ l0 " 8 u latCnCi f S ' 311(1 their non -P»edictability (cache misses, variable length configuration^ 
FIFH T l3PPed Se , Veral confi ^ions wide using the' ooaftguJSSw^T^ 

2£ °/T pelme f °f i°°f e couplin * S ° J ^ configuration is executing and the data for the ne£ has 
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all write backs blocked 
by in-use KAMs 




wait 


configuration 
finished 


execute ] 









^preload needed 
urgently 



write back 


« all preloads hlncked by 






dirty or in-use IRAMs 


preload 



no clean 
CRAM instance" 



discard LRU 
clean IRAM 




no empty 
ERAM instance 



Figure 4: State transition diagram for the XPP cache controller 
The XPP cache controller has several tasks Tlie«. a™, a^-.^a ... 

transitions take place along the edges bSS^SlT u ? ? **** m 46 above dia S ram - State 
soon as the condition is noftme ^TZ^^^ *" ** ed * e As 

states are as follows: * ' 6 State transitlon place. The activities for the 

%5£T£Z&2: 2ZS%32- ^ to &,fl " i — *— ™. <«>= 

sees ratssffiir" - — 

a tfr<y one has to be written bac Tto , fo^^^^T"** ° rcfeaB IRAM exis *. 
dirty IRAM instances JT^o^^SSSSS?" "T?* 0 ""^ n ° e ' M ^' cfeflW » 
instance in an IRAM hkJl^.it^SESSSSS^Z^ * m than ° ne 



, — j "i"vv Ulliy UII 

instance in an IRAM block - otherwise no 

Figure 5: Add 
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ngS&fiaEDfcoi s multithreading 

of virtual 
processors 
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ir~^r fetCh 1 reorder 1 issue 
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In an SMT environment the load FIFOs have to be replicated for every virtual processor Th* „,w 
of the functional units are fed from the shared fetch / reorder / issue stale All K • f p,p<?,mes 
» parallel. Different units can execute instructions of different !ZJ prlcetsorT 
So we get the following design parameters with their smallest initial value: 
IRAM length: 

FIFO length: 

This parameter helps to hide cache misses during preloading- The loneer the FrFn i«,«rt, J 
less disruptive is a series of cache misses for a single configuration * k" 8 *' *" 

IRAM duplication factor: (pipeline stages + caching factor)*virtual processors- , 
Iwonef ^ 18 nUmbCr ° f PipeUne ^ LD/EX/ WB plus one for every FIFO stage 
Caching factor is the number of IRAM duplicates available for caching- l 
Virtual processors is the number of virtual processors with SMT: " 

The size of the state of a virtual processor is mainly dependent on the FIFO length. It is: 
FIFO length * #IRAM ports * (32 bit (Address) + 32 bit (Size)) 

This has to be replicated for every virtual processor. 

The total size of memory used for the IRAMs is: 

#IRAM ports * IRAM duplication factor* IRAM length * 32 bit 

2.5.2 Implementation Improvements 

Write Pointer 

Stag. ^ C0Mn " ler 0h °° SeS *° blocki "8 ^ »" n* d P ,S£ 

Longer FIFOs 

PACT XPP core, the prefetch FIFOs in the above drawing can be extended Thus the IRAM ♦ 
•> «»e next oonflg^on oLext fl* ^ ^S^^gSS T*Z 
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any new IRAM area and a cn^ntfy m A M ™ , / ? ^ If overta P •x***". 
IRAM is simply „„, loaded uSel^^^^^ "^J* "» »™ 

of to new eonfigm*™ ia delayed until u,et^cfd^f™i^f *" '""^ ^ " 

preload. The compiler doea ao i^SS^X^V^S^T^^' 1 ' 
eoaleaomg algorithm U me aame aa osed for regiarfr JJ^^3£^!T" ^ 

Read Only IRAMs 

Sfc^S^i^^W «• *e source of a preload 
configuration has finished, and - S content 22 ZSSLFV^ ^ ° n,y be Started > when the 
the cache. To decrease the number of pTp^ 

IRAM state. If the IRAM is read nX ST™ ♦ T ' ls L ben , ef,CIaI to add an additional /■eao'-o/i/y 
the other IRAM canned wSouf delt ^^T" be and * e P-load of the data to 
XppPreload and the XppPreSeat ^T™ ^ pre, ° ad in <*™*°ns: The 
format, that has two addLna?Wts stet^ be c 0mb ^ ed tQ a singJe instnj 

de b u ggi „ 8 , violations should be ch ,5;^^ 

2.5.3 Support for Data Distribution and Data Reorganization 

computation. So it is best, if the IRAMs are acce^d in iS? a 5^ &Vailab,e for the actual 
oHentedas well, further encouraging linear™ 

Per^cS 

top of the memory hierarchy is me source of IRAMs *° remove Ais bottleneck. The 

when the access pattem is £^ ~ £ S^ETd^ " W ™ 

MSE^ defini l- f block reading and simple 

data bandwidth be^Z^^SSiSt^i^ E tUing U needed t0 inc ^ *e 
is accessed in optimal order™^ 5 the^S^ ? T tran 1 Sformed in a wa y ** data 
pattern by data layout rean-angemSe g Ly ZKth^VT'f T m ° dify access 
desired pattern. If none of tW opSoS STS?" *• data » messed in the 

layout is fixed, there are still twoTSS^ ° r b — the data 

Data Duplication 
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Using several IRAM preload commands specifying just different target IRAMs: 

IRAMs is needed for to the 

additional tmnsfer times dLot exceed ti^^ 

o Using an IRAM preload instruction to load multiple IRAMs concurrently- 

when to nuntor ofZ^fl^^t^lw"^" " " "''^ 

The interface of this instruction looks like: 
XppPreloadMnlHple (int IRAMS, void ♦StartAddre.s, i„t Size) 

* XPPPre ' 0ad ' instructions with to exception 

There is no "clean" version, since data duplication is applicable for read data only. 
Data Reordering 

o Adding additional functionality to the hardware: 

o Adding a vector stride to the preload instruction. 

A stride (displacement between two elements in memory), is used in vector i™h 
operations to load e.g.: a column of a matrix into a vector register 

SSJ 1 \ non : se£mential but still linear access pattern. It can be implemented in 
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The interface of the instruction looks like: 

XppPreloadStride (int IRAM, void *StartAddre SS , int Size, int Stride) 
XppPreloadCIeanStride (int IRAM, void *StartAddress, int sLe! int Stride) 

T ^TJZ£tZ£** XPPPre, ° ad ' ^"an auctions with the 

o Reordering the data at run time, introducing temporary copies, 
o On the RISC: 

The RISC can copy data at a maximum rate of one word oer cvcle for c ; m „r aa 
computations and at a somewhat lower rate for more complex ones * 

With a memory hierarchy, the sources will be read from memorv far r*nh~ *u 
were used recently) once and written to the temporary Tpy SwHl Senli y 

memory used for the temporaries. Since temporaries are allocated on X , f 

3mS«=mk=! ■"■= , «S5SS 

o Via an XPP configuration: 

Sth^ reS ? and WritC ° ne VaIue from ever y !RAM per cycle Thus if half 

of the IRAMs are used as inputs and half of the IRAMs are used as S.nwf 5 
eight (or more, depending on the number of IRA^vaml c2 £ 317 

IRAM Chaining 

If the PAEs do not allow further unrolline but there are ctin iPA U .ift j . . 
additional blocks of data into these IRAMs and ™^ u UmSed ' * is possible to Ioad 

This does not increase throughput as^h^ trl by ™eans of an address selector. 

Pipeline startup delays ^SSZS^i^^S^ ** * ^ ^ t0 Wde *»* 



2.6 Software / Hardware Interface 

hSale^ changes to the hardware, the 

handlmgwillbediscusseo: 8 ^ f ° Il0Wm8 *" most P rominent Ganges and their 

2.6.1 Explicit Cache 

The proposed cache is not a usual cache, which would be - not r ftnc w 0 • 

invisible to the programmer / compiler, as its op^arion Is t^spZt ^ peribnn ? ,c 6 i SSUes " 
exphcrt cache. Its state has to be maintained by software The proposed cache is an 
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Cache Consistency and Pipelining of Preload / Configuration I Write-back 

^^S^^^^^ ^ 

Only this IRAM will be dirty and will bTwrtoen b a S to wrrtton » this is Perfectly ok: 

Ss^ 

controller and the XPP array can be seen « «L 2! T I- , data hazards - A * *e cache 

these data hazards are ^l^^S^^f^,^ TO effective ^ 

ordinary pipeline, there are two |««riwlK|SSito ^ P* 6 ^ As with ^ 



Hardware interlocking: 



Interlocking is done by the cache controller If the r a oi,»^ ♦ 

dirty or in-use item in IRAMx overlap a Tern™ * a ^ rolle ' det «*s> that the tag of a 
to stall that preload, effectively ^SS^Z^^S^ * IRAM prcl ° ad ' * has 

preload. * wsnaiizmg the execution of the current configuration and the 

• Software interlocking: 

I- ? insert e xp „ei, 

modular alias- and data- dependence LC« m ' ei ? ock ?- Inter - Procedural and inter- 
scheduling algorithms helpT IZiaTt > ZJ Tr * * ^ CaSe ' while 
instructions. deviate the ,mp act of the necessaiy synchronization 

^z^isr^tasr sb " s due 10 cache ^ smt - - «- 

Code Generation for the Explicit Cache 

scheduler. te scheduled as early as possible by the instruction 

' EXZSZSZSF*** WhiCh ShOUM ^ ^ «*"'"'- " -* - Pcssibte by tbe 

estimated maximum of to *~* mlnlmUm - 1 

• IRAM synchronization instructions which shmtM h. „„u j i j . 

instruction scheduler. These insfrSi^nTm,,?^ scheduled as late as possible by the 
RISC to the data «JS^^^Z ootenS" ^ ^ 3CCess of 

these instructions will follow a Z e ch^o/ + m ° dlfied m the IRAMs - Typically 
significantly decrease perfonLce g of com P u ^ons on the XPP, so they will not 
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Asynchronicity to Other Functional Units 

T A J/£ P n y? 0 mUSt be iSSU6d by C ° mpiIer ' if m ^^tion of another functional unit (mainly the 
Ld/St unit) can access a memory area, that ia potentially dirty or in-use in an IRAM TnU forcef 1 
synchromzauon of the instruction streams and the cache contents, avoiding data lizards A t££Lj 

csk?^5^ *** alias limits the freque4 ° f ^ ~as 



2.7 Another Implementation 

For the previous design, the IRAMs are existent in silicon, duplicated several times to keen the 
pipeline busy. This amounts to a large silicon area, that is not fully busy all the me "speckllv Zhe n 

»Lln? T y " n °^ USe > bUt aS ^ Wh6never -Ration does not usTall o'f th IrS 
present m & .array. The duplication also makes it difficult to extend the lengths of the IRAMs asthe 
total size of the already large IRAM area scales linearly. ™' as me 

™irt» 5? silic u on u efficie nlimplementation, we should integrate the IRAMs into the first level cache 
makmg this cache bigger. This means, that we have to extend the first level cache conSe to feed all 
IRAM ports of the PAE array. This way the XPP and the RISC will share the fi^^^St^ 
T OK f^^T^ WheneV6r the m is executin & * wil1 t«l as much cache spa?e as k needs 

KJSST" ±e msc alone is ^ * wiU have pIen * of ^oJZS^t 

The PAE array has the ability to read one word and write one word to each IRAM port everv cvcle 

Z fo 3 h mite t t0 ' reSd W a ^ aCC6SS * 6r without limiti^ro^^^ 

data has to be written to the same area in the same cycle, another IRAM port canbe useri Thi. 
increases the number of used IRAM ports, but only under rare chcumstances^ ? ^ 

This leaves sixteen data accesses per PAE cycle in the worst case. Due to the worst case of all sixteen 
memory areas for the sixteen IRAM ports mapping to the same associative bank th Si 

sss^ss: 1 6 - way set ™ s avoids 

Two factors help to support sixteen accesses per PAE array cycle: 

• The clock frequency of the PAE array generally has to be lower than for the RISC by a factor 
of two to four The reasons he in the configurable routing channels with switch matrices which 
cannot support as high a frequency as solid point-to-point aluminium or copper traces 

This means that two to four IRAM port accesses can be handled serially by a single cache port 
as long as all reads are serviced before all writes, if there is a potential overlap. TWs cTbe 
accomplished by assuming a potential overlap and enforcing a priority ordering of all accuses 
giving the read accesses higher priority. g acces ses, 

• A factor of two four or eight is possible by accessing the cache as two, four or eight banks of 
lower associativity cache. B u«uiiu> oi 

For a cycle divisor of four, four banks of four-way associativity will be optimal. During four 
successive cycles, each bank of four-way associativity can serve four different accessT Up Z 
four-way data duplication can be handled by using adjacent IRAM ports that are connected to 
the same bus (bank). For further data duplication, the data has to be duplicate ^ ex P Hcftly 
using an XppPreloadMultiple cache controller instruction. The maximum data duplSu^ for 
sixteen read accesses to the same memory area is supported by an actual data dupSion 
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factor of four: one copy in each bank. This does not affect the cache RAM efficiency as 
adversely as an actual data duplication of 1 6 for the design proposed in section 2™°™"* as 



IRO 



IK 2 
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IK 4 



IRS 
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reno 
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IK 12 



IK 13 



fftl'4- 




Figure 6: Cache structure example 
The cache controller is running at the same speed as the RISC. The XPP is running at a lower (e e 
quarter) speed. This way the worst case of sixteen read requests from the PATfrmy nee^to be 
serviced i n fo ur cycles of the cache controller, with an additional four read requeste fr^m^the RISC So 
one bus at full speed can be used to service four IRAM read ports. Using foi^^S^to 

The RISC still has a 16-way set associative view of the cache, accessing all four four-way set 

Z tn EE m P ri eL ° Ue t0 d3ta dUpHcation lt is P° ssible ' * at seve ^ return a SmS 
has to be taken care of with a priority encoder, enabling only one bank onto the data bus. 

The RISC is blocked from the banks that service IRAM port accesses. Wait states are inserted 
accordingly. The impact of wait states is reduced, if the RISC shares the second" access port of a 
two-port cache with the RAM interface, using the cycles between the RAM transfe^forTaccCes 

fr^ tl T r^L 8 ^ ° ne , IRAM read COuld P° tentia "y address the same memory location as a write 
from another IRAM; the value read depends on the order of the operations, so the order must be fS 

rS aV V° T ,f ^ a " readS ' bUt bef ° re the reads o?the ™* ^ycta. TWs crt r e£ed 
if the reads and writes actually do not overlap. However a simple priority scheme for me bus aSSsS 
enforces the correct ordering of the accesses. e Dus accesses 

Jh^Z? le \° f ^ ad '^r^ iSt T y iS m ° re SCVere with data duplication, when only one copy of 
the data is actually modified. Therefore modifications are forbidden with data duplication. 
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2.7.1 Programming Model Changes 

Data Interference 

With this design without dedicated IRAMs, it is not possible any more to load input data to the IRAMs 
and write the output data to a different IRAM, which is mapped to the same address, ZsoZ*^ 
the original, unaltered input data during the whole configuration. operating on 

As there are no dedicated IRAMs any more, writes directly modify the cache contents which will h P 

L SU T lng , readS ' ™ S Chang6S ^ P^mmaung ™>del ^ificantly Add SL^f„^£ 
depth compiler analyses are necessary accordingly. 

2.7.2 Hiding Implementation Details 

Tlie actual number of bits in the destination field of the XppPreloadMultiole instruct;™ i* 
nnplementa^n dependent It depends on the number cache banks and tL^^^^^^ 

HnwT tH * * G fr T? Cy diviS ° r ° f the m PAE ^ ^lative to the cacLTfrequenc^ 

However the assembler can hide this by translating IRAM ports to cache banks thus redZTZ 
number of bits from the number of IRAM ports to the number of banks. For ZuU ^t is suSnUo 
know ttat each cache bank services an adjacent set of IRAM ports starting at a powe of Zo Ss it 
b ? t0 use d f a duplication for adjacent ports, starting with the highest power of two bLger ft» the 
number of read ports to the duplicated area. e f wo Digger tnan tne 
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3 Program Optimizations 



3.1 Code Analysis 

the program. Mc/dcils „ beSd ta r« e S (^S ** - d m ° m °* ""^ ta 

3.1.1 Dataflow Analysis 

operating Vsets. A daMow e q„S?„ Z £ .TT? dMB ° V ' 

formulated as J i, mat can be an instruction or a basic block, is 

oeginning of/, /»/>/, but were not deleted during the execution of /, Killfij! 
These equations can be used to solve several problems like: 

■ the problem of reaching definitions 

' 25 £Z£ 2^1FJ^£3£S** - ,h " - te - * 

■ the available expressions at a point in the program, 

■ the live variables at a point in the program, 

whose solutions are then used by several compilation phases, analysis, or optimizations. 
rTinl^ 

the register allocation A Def SL 7 *• f? de P endenc * analysis for scalar variables or by 
ail visfble uses from tht ^TrnZtZaS ° f& Variab,e and is *• «* of 

vanable that can be found at the exit of Bl is Ex%n=%%m^A Z J v m Hence 1,16 

Moreover we have Ex(B2)=E X (Bl) as no va^We is d2i nr ™£ ° f 18 Ex ( B4 > = {^4» ■ 
uses of* in and A depend on me deSon of ?£ " 5/ thltth ^ ^V? fmd ** *» 
definitions of x in Bl and 53 The nS n~ I • 5/ ' * at ? e USe of * ln SS de P e »d on the 

D(Sl) = { S2,S3,S5)L D(S4) J£ 5} ^ ""^ ^ defmitions « then 
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Figure 7 .Control-flow graph of a piece of program 



3.1.2 Data Dependence Analysis 

A data dependence graph represents the dependences existing between operations writing or reading 
the same data. This graph is used for optimizations like scheduling, or certain loop oSzations to 
test their semantic validity. The nodes of the graph represent the infections, aiX^^LS 
the da a dependences These dependences can be of three types: true (or flow) dependence wZ a 
JIT 1 **!! bef0r l bein S read ' anti-dependence when a variable is read before bebg wntten 
and output dependence when a variable is written twice. Here is a more formal definition [3 j! 

Definition 

Let S and S" be 2 statements, then S' depends on S, notedS S S' iff: 

(1) S is executed before S' 

(2) 3v g VAR i v e DEF(S) I USE(S>) v v e USE(S) I DEF(S') v v e DEF(S) I DEF(S<) 

(3> Statement T SUCh th3t 5 iS eX6CUted before 7- »d 7 is executed before S' and 

Where F^R is the set of the variables of the program, DEFfS) is the set of the vnriahw ^ ^ j , 
instruction 5, and USB® is the set of variables usfdTy inaction S by 

Moreover if the statements are in a loop, a dependence can be loop-independent or loop-carried This 
notion introduces the definition of the distance of a dependence. When a SpStSZ Io^o 

a^dTnt^r 8 ^ ^ TT betWee " ^ inSt3nCeS ° f different statementsIX s^e Oration 
and then its distance is equal to zero. On the contrary when a dependence occurs between t^o 
mstences m two different iterations the dependence is loop-carried, and ^^i^TJZ 
difference between the iteration numbers of the two instances. «»™» 1S equal to tne 
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?*° n ^ Erection of dependence generalizes the notion of distance, and is generally used when 
die distance of a dependence is not constant, or cannot be computed with precisiof The di^SL t 
dependences given by <, if the dependence between S and /occurs S^St^'SS^n 
iteration before the iterat on of the instance nf - ;p • * . , insi ance ot S is m an 

loop can be vectorized if its data dependence graph does not " Va,ld ' FOr lnStanCe a 



for(i=0; i<N; i=i+l) { 
S: a[i] = b[i] + 1; 
SI: c[i] .= a [i] + 2; 
} 




Figure 8: Example of a true dependence with distance 0 

for(i=0; i<N; i=i+l) { 
S: a[i] = b[i] + 1; 
SI b[i] = C [i] + 2; 

} 




Figure 9: Example of an anti-dependence with distance 0 on array b 

for (1=0/ i<N; i=i+l) { 
S: a[i] = b[i] + 1 ; 
SI: a[i] = c[i] + 2; 
} 




Figure 10: Example of an output dependence with distance 0 on array 

a 
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for (j=0; j<=N; j++) 

for (i=0;i<=N;i++) 



{ 



SI: 



S2; 



c[i] [j] = 0; 

for ( k=0 ; k<=N; k++ ) 

=c[i][j] + a[i] [k]*b[k] [j]; 



Figure 11: Example of a dependence with direction vectorf= =) 
between SI andS2 and a dependence with direction vector /=/=<) 
between S2 and S2. 




for (i=0;i<=N;i++) . 

for (j=0; j<=N; j++) 
S: afi] [j] = ati] [j+2]' +'b[i]; 




8 a (0,2) 



Figure 12: Example of an anti-dependence with distance 



vector (0,2). 



3.1.3 Interprocedural Alias Analysis 



^Uc^oecod^opimL^Mi^^^^ ° n ^ ™*y™ and on the 

■ with rtatically allocated data, like unions in C where all fields refer to the same memory area, or 
with dynamically allocated data, which are the usual targets of the analysis, or 

■ with pointers referencing static data, like in C. 

In Figure 13, we have a typical case of aliasing where/, aliases b. 

int b[100] , * p; 

for{p=b;p < &b[100];p++) 
*p=0; 



Figure 13: Example for typical aliasing 
Flow-sensitive alias analysis is ante toto^^^S A ^ n< \" Is calted flow-insensitive. 
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the same memory location. A^fZ^Z^lT^' P ° S ^ y * different times > refer * 
and, must, throughout the ^^S^^^TS^T wSo V B ^ s <*» * iff* 
Figure 14, if we consider no^-mseZtivTm^l f^ I ™* l0Cati ° n - In case <* 

consider flow-insensitive musSd as faK«ST^T m [°V natl0n 'P alias b holds, whereas if we 
use depends on the XbZVsZf^t^ce lZ \T ^ ^ ^ ° f ^"nation to 
statements, must-aliises have iS^S f t™ wZ^SS ^IT " 
aliases are necessary. aata de Pendence graph may- 



Bl 



int*p,b[100]; 



B2 ♦ p = b . 
< uses of b andp> 
*p = mallocO; 





B3 

*p = malloc(); 
<uscs of b and p> 



B4 

<uses of b and p> 



Figure 1 4: Example of control-flaw sensitivity 

by „o„-IocI variaMes 2 SS^S SetSZI? ^ ^ t'* *° deK0 ' *» ^ 
alidad faough tte tacJcaU ITJS ^^S?* " ' " 

void foofint *i,int* i) 
{ 



} 



«ri = *j+l; 



£00 (&k, &k) ; 



Figure 15: Example for aliasing by parameter passing 

3.1.4 Interprocedural Value Range Analysis 

of variables and then consider oiSSSM^S^^^S^^ 0 - ki 0 " 1 J -9 - «» Hie types 

program. Thus it can determine SfaSSL tf £2 in ,1 h v "T^ 88 dU " ng ^ eXecution of 
not, or determine the iteration rLge of Sp nfs s « ^ to be met or 
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void foo(int *c,int N) 
{ 

int i; 

for (i=0;i<N;i++) 
^ c[±] = g(i,2); 



if (N > 10) 
foo(a,N) ; 

else 

foo(b,N); 



3.1.5 Alignment Analysis 

can be viewed as a multi-diniensional arr» v AS ? i? — of cell", «s realization in hardware 
identify memory locaTns Salwayt 3, ™^;^ *' ig r ^ ""^ wi " 

en~Jn^ 




- » automafleaUv „ a lig nmen, p^Sa^t me r^tST. p^eT 1°^ 
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S P i2r»„U^SrX7 blem ' ™" - 66 — " - 1 »a ly s is ta.cing 

3.2 Code Optimizations 

Moa of the optima «, Wo^o,,, presented hwe ^ „ e found ^ ^ fa ^ ^ ^ ^ 

3.2.1 General Transformations 

sequel of this doe-m^T * ** aPPe * r m * mm f* r - but *W «™ mentioned in the 

Constant Propagation 

he done dn™* ^execnHo. CS^STrSSjJT 10 

G Z 3? 6 '* for (1=0; i<= 256; i++) 

fordlo;! <= N;i++) a[i] " b tU + 3; 

a[i] = b[i] + C ; 

Figure 16: Example of constant propagation 

Copy Propagation 

the reg ist P e r pressure Ldt^ ™- — - -duces 

' : t M; 

for (i : 0; i <= N;i++) f0r( i7?} i -^r?{ i t +) r-i 

atr] - b[rj + a[i]; 3[t] ~ b[t] + a[l] .' 

Figure 1 7: Example of copy propagation 
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Dead Code Elimination 

else 

for (j=0; j<10; j++) 
} a[j + l] - a[j] + b[j]; 

Figure 18: Example of dead code elimination 

Forward Substitution 

^ ~ ,™ t for(i=0; i<= N; 

Figure 19: Example of forward substitution 

Idiom Recognition 

a[i] = c; i 

} 

Figure 20: Example of idiom recognition 

3.2.2 Loop Transformations 

Loop Normalization 
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no™7idt^° P 3na,ySiS ' 3nd * ^ enaWes *» use ° f dependence tests quiring 

f ° r( iTf; -tr-t =i+2) for(i-0; i<(N-2)/2; 

a[l J - • a[2*i+2] r b[2*i+2]; 

■F/gwe 2/: Example of loop normalization 

Loop Reversal 

This transformation changes the direction in which the iteration space of a loop is scanned It i. 
frequently used in conjunction with loop normalization and other tLsfhZ, iJfTT f 
interchange, because it changes the dependence vectors. transformat.ons, like loop 

for(i=N; i>=0; i-) for(i=0; i<=N; 

a[l] - b t^; a[i] = b [i]; 

Figure 22: Example of loop reversal 

Strength Reduction 

This transformation replaces expressions in the loop body by equivalent but less exnensiv* nn»* » „ 
be used on induction variables, other than the loop variable ^S^SZS^l " 
for(i=0; i<N; i++) t _ c . 

a[i] = bfi] + c *i; for (1=0; 1<N; i++) { 

a[i] = b[i] + t; 
t = t + c; 

} 

Figure 23 : Example of strength reduction 

Induction Variable Elimination 

This transformation can use strength reduction to remove induction variables from a loon hence 
reducing the number of computations and easing the analysis of thTloor, ThU 
dependence cycles due to the update of the vmiabl^Lwi^SSitatto. P " * ^ 

f ° r( i = !\ i<=N; ±++) { for(i=0; i<= N ; i++) { 

.[I] - + b?i] +atk] , } a[i3 =Mi] + *[*+<i+l>*3], 

} 

k = k + (N+l) *3; 
Figure 24: Example of induction variable elimination 

Loop-Invariant Code Motion 

for (1=0; i<N,- if ( N >= 0) 

a[i] = b[i] + x*y; c = x * y . 

for (1=0; i<N; i++) 
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a[i] = b[i] + c; 
Figure 25: Example of loop-invariant code motion 

Loop Unswitching 

This transformation moves a conditional instruction out of a loop body if its condition is loop- 
invariant. The branches of the new condition contain the original loop with the appropriate statements 
from the original condition. Loop unswitching allows parallelization of the loop by removine control- 
flow code from the loop body. 6 

for(i=0; i<N; i++) { if (x > 2) • 

a[i] = b[i] + 3; for(i=0; i<N; i++) { 

if (x > 2) a[i ] = b[1 ] + 3; 

b[i] =. c [i] +• 2; b[i] = c[i] + 2; 

else } 
b[i] = C [i] - 2; else 
* for(i=0; i<N; i++) { 

a[i] = b[i] + 3; 
b[i] = c[i] - 2; 

} 

Figure 26: Example of loop unswitching 

If-Conversion 

This transformation is applied to loop bodies with conditional instructions. It changes control 
dependences into data dependences and enables a subsequent vectorization. It can be used in 
conjunction with loop unswitching to handle loop bodies with several basic blocks. The conditions 
where array expressions could appear, are replaced by boolean terms called guards. Processors with 
predicated execution support can directly execute such code, and configurable hardware can use the 
result of guards to direct dataflow through different branches by means of multiplexers and 
demultiplexers. v 

for(i = 0;i < N; i++) { for(i = 0;i < N;i++) { 
a[±] = a[i] + b[i]; a [i] - a[i] + b[i]; 

if (a[i] != 0) c2 = (a[i] != 0) ; 

if (a[i] > c[i]) if (c2) c4 = (a[i] > c[i]); 

a[i] = a[i] - 2; if ( C 2 && c4) a[i] = a[i] - 2; 

else if (c2 && !c4) ati] = a[i] + 1; 

a[i] = a[i] + 1; dti] = a[i] * 2; 

d[i] = a[i] * 2; } 



} 



Figure 27: Example of if-conversion 



Strip-Mining 

This transformation enables to adjust the granularity of an operation. It is commonly used to choose 
the number of independent computations in the inner loop nest. When the iteration count is not known 
at compile time, it can be used to generate a fixed iteration count inner loop satisfying the resource 
constraints. It can be used in conjunction with other transformations like loop distribution or loop 
o^strip^mlning 18 ^ ^ SeCti ° ning - Cycle sh "nking, ^so called stripping, is a specialization 
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for(i=0; i<N; I++) U p = (N/16) *16; 

a[i] = b[i] + c; for(i=0; i<up; i = i + 16) 

for(j=i;j <= 16;j++) 
= b[j] + c; 
• for ( j=i+l; j<N; j++) 

a[i] = b[i] + c; 

Figure 28: Example of strip-mining 



Loop Tiling 

This transformation modifies the iteration space of a loop nest by introducing loop levels to divide the 
iteration space in tiles. It is a multi-dimensional generalization of strip-mining. It is generally used to 
improve memory reuse, but can also improve processor, register, translation-lookaside buffer (TLB) 
or page locality. It is also called loop blocking. 

The size of the tiles of the iteration space is chosen such that the data needed in each tile fits into the 
cache memory, thus reducing the cache misses. In the case of coarse-grain computers, the size of the 
tiles can also be chosen such that the number of parallel operations of the loop body matches the 
number of processors of the computer. 

for(i=0; i<N; i++) for(ii=0; ii<N; ii - ii+16) 

for(j-0; j<N; j++) for(jj=0; jj<N; jj = jj+16) 

a[i][j] =b[j][i]; for(i=ii; i< min (ii+15, N) ; i++) 

for(j=jj; j< min(j j + 15,N) ; j++) 
a[i] [j] = b[j] [i]; 

Figure 29: Example of loop tiling 

Loop Interchange 

This transformation interchanges loop levels of a nest in order to change data dependences. It can: 

■ enable vectorization by interchanging an independent loop with a dependent loop, or 

■ improve vectorization by pushing the independent loop with the largest range further inside, or 

■ deduce the stride, or 

■ increase the number of loop-invariant expressions in the inner-loop, or 

■ improve parallel performance by moving an independent loop outside of a loop nest to increase the 
granularity of each iteration and reduce the number of barrier synchronizations. 

for(i=0; i<N; i++) for(j=0; j<N; j++) 

for(j=0;j<N; j++) for (1=0; i<N; i++) 

a[i] = a[i] + b[i][j] ; a [i] = a[i] + b[i][j] ; 

Figure 30: Example of loop interchange 



Loop Coalescing / Collapsing 

This transformation combines a loop nest into a single loop. It can improve the scheduling of the loop 
and also reduces the loop overhead. Collapsing is a simpler version of coalescing in which the number 
of dimensions of arrays is reduced as well. Collapsing reduces the overhead of nested loops and multi- 
dimensional arrays. Collapsing can be applied to loop nests that iterate over memory with a constant 
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for(i=0; i<N; i++) 

for(j=0;j<M; j++) 

[j] = a[i] [j] + c; 



for(k=0; k<N*M; k++) { 

i = ( (k-1) /m) *m + 1; 
j = ( (T-l)%m) + 1; 
a[i] [j] = a[i] [j] + C ; 



} 



Figure 31: Example of loop coalescing 

Loop Fusion 

improves the load balance of parallel loons AlS^ I ? '• ' ° r P3ge loCallty ' 311(1 
conditional instructions to take ^diSSo^T^ ^ ^ ^ mt0 aCC ° Unt ^ 



for(i=0; i<N; i++) 
a[i] '= b[i] + C ; 

for(i=0; i<N; i++) 
d[i] = e[i] + c; 



. for(i=0; i<N; . { 

a[i] = b[i] + C ; 

dti] = e[i] + C ; 

} 



Figure 32: Example of loop fusion 



Loop Distribution 



for(i=0; i<N; i++) { 

a[i] = b[i] + C ; 

d[i] = e[i] + C ; 

} 



for(i=0; i<N; i++) 
. = b[i] + C ; 

for (i=0;i<N; i++) 

= e[i] + C ; 



Figure 33: Example of loop distribution 



Loop Unrolling / Unroll-and-Jam 



This transformation replicates the original loop body in order to set a lar^r mP a , 

unrolled partially or completely It is used to i Zll Z f D . e " A Ioop ^ be 



for(i=0; i<N; i++) 
a[i] = b[i] + c 



for(i=0; i<N; i = i+2) { 
a[i] = b[i] + c; 
a[i+l] = b[i+l] + C ; 



} 
if 



( (N-l)%2) == l) 
a[N-l] = b[N-l] 



+ c; 



Figure 34: Example of loop unrolling 
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Loop Alignment 

This optimization transforms the code to achieve aligned array accesses in the loop body The 
application of loop alignment transforms loop-carried dependences into loop-independent 
dependences, which allows extracting more parallelism from a loop. It uses a combination of other 
transformations like loop peeling or introduces conditional statements. Loop alignment can be used in 
conjunction with loop fusion to align the array accesses in both loop nests. In the example below all 
accesses to array a become aligned. ' 

for(i=2;i <= N;i++) { for{i=l; i<=N; i++) { 

a[i] - b[±] + c[i]; if (!>!) a[i] _ b[i] + c[i] 

d x = a[i-l] * 2; if (i<N) d[i+1] = a[i] * 2 

e[x] = a[i-l] + d[i+l]; if (i< N) e[i+l] = a[i] + d[i+2] ; 

} } 

Figure 35: Example of loop alignment 

Loop Skewing 

This transformation is used to enable parallelization of a loop nest. It is useful in combination with 
loop interchange. It is performed by adding the outer loop index multiplied by a skew factor / to the 
bounds of the inner loop variable, and then subtracting the same quantity from every use of the inner 
loop variable inside the loop. 

for (1=1; i <= N; i++) for(i-l; i <= N; i++) 

for(j=l;j <= N; j++) for(j=i+l;j <= i+ N; j++) 

a[i] = a[i+j] + C ; a[i] = a[j] + c; 

Figure 36: Example of loop skewing 

Loop Peeling 

This transformation removes a small number of starting or closing iterations of a loop to avoid 
dependences in the loop body. These removed iterations are executed separately. It can be used for 
matching the iteration control of adjacent loops to enable loop fusion. " 

for(i=0; i<=N; i++) a[0][N] = a[0][N] + a[N][N]; 

a[i][N] =a[0][N] +a[N][N]; for (i=l; i<=N-l; i++) 

a[i][N] = a[0][N] + a[N][N]; 
a[N] [N] = a[0] [N] + a[N] [N] ; 

Figure 37: Example of loop peeling 

Loop Splitting 

This transformation cute the iteration space in pieces by creating other loop nests. It is also called 
index Set Splitting and is generally used because of dependences that prevent parallelization The 

S peelmT 6 *"* ^ * * ^ *" Cne " * ^ ^ 88 & 8 eneralizati ° n of 

for(i=0; i<=N; i++) for (i=0; i< (N+l) /2; i++) 

a[i] = a[N-i+l] + c; a[i] _ a[N-i+l] + c; 

for(i= (N+l)/2;i <= N;i++) 
a[i] = a[N-I+l] + C ; 

Figure 38: Example of loop splitting 
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Node Splitting 

This transformation splits a statement in pieces. It is used to break dependence cycles in the 
dependence graph due to the too high granularity of the nodes, thus enabling vectorization of the 
statements. 

for(i=0;i < N;i++) { for(i = 0/ i < N ;i++) { 

b[i] = a[i] + C [i] * d[i] ; tl[i] = C [i] * d[i]- 

a[i+l] = b[i] * (d[i] - c[i]); t 2[i] = d[i] - C [i]; 

} b[i] = a[i] + tl[i]; 

a[i+l] = b[i] * t2[i] ; 

} 

Figure 39: Example of node splitting 

Scalar Expansion 

This transformation replaces a scalar in a loop by an array to eliminate dependences in the loop body 
and enables paralleli2ation of the loop nest. If the scalar is used after the loop, compensation code 
must be added. 

for(i=0; i<N; i++) { for(i=0;i<N; { 

c = b[i]; tmp[i] = b[i]; 

a[i] = a[i] + c; a [i] - a[i] + tmp[i] ; 

> } 

c = tmp[N-l] ; 

Figure 40: Example of scalar expansion 

Array Contraction / Array Shrinking 

This transformation is the reverse transformation of scalar expansion. It may be needed if scalar 
expansion generates too many memory requirements. 

for(i=0; i<N;i++) for(i=0; i<N;i++) 

for(j=0; j<N;j++) { for(j=0; j<N;j++) { 

^]" ] = a t i Hj] * 3; t [j] = a[i][j] * 3; 

^ ■ b[x][ 3 ] = t[i][j] + c[j]; ^ b[i][j] = t[j] + c[j]; 

Figure 41: Example of array contraction 

Scalar Replacement 

This transformation replaces an invariant array reference in a loop by a scalar. This array element is 
loaded in a scalar before the inner loop and stored again after the inner loop, if it is modified It can be 
used in conjunction with loop interchange. 

for(i=0; i<N; i++) for (i=0; i<N; i++) { 

for(j=0; j< N ;j++) tmp = a[i]; 

a[i] = a[i] +b[i][j] ; for(j=0; j<N;j++) 

tmp = tmp + b[i] [j] ; 
a[i] = tmp; 

} 

Figure 42: Example of scalar replacement 
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Reduction Recognition 

A " « — that computes a 

goal is Aen to perfo rm as maTy ^2 ^11, ^ "J™"" of a vector ^ Stance. The 
register of partial results ^ZZTZt ' °^ 7^ " t0 accumuIate a vector 

achieved by' reducing the vec^^£ a^ ^J^l^ ^ MaXhnUm is 
results are summed, etc. ° f eJements summed, then pairs of these 

for(i=0; i<N;i ++ , for(i=0; i< N; i= i+6 4) 

3 " 3 + aCi]? tmp[0:63] = tmp[0:63] + a[i:i+63]- 

for(i=0; i<64;i++) 
s = s + tmp[i] ; 

Figure 43: Example of reduction recognition 

Loop Pushing / Loop Embedding 

* the called Action. It is an inter- 

--.ytheprocedurec^^^ 

for(i=0; i<N; i++) f2(x) 
f (X, i ) ; 

, , . void f2(int* a) { 



^ a[i] = a[i] + C ; 



} 

Figure 44: Example of loop pushing 

Procedure Inlining 

This transformation replaces a call to a procedure bv the code of rt P nrn ^A„ •* • 
procedural optimization. It allows a loon ne^tn hJ SL2T I P"***™ ^elf. It is an inter- 
procedure call, and can improveTcaHty ? P™"*™* removes overhead caused by the 



f (a,i); 

void f (int* x, int j) { 
x tj] = x[j] + C ; 



a[i] = a[i] + C ; 



^/gwe 45: Example of procedure inlining 

Statement Reordering 

^SS^SSST iMtniCti0nS ° ffhe IO ° P t0 ™ di * ^ ** ^endence graph and 

} CU] ~ a[1 - 1] " 4 ' ^ a[i] = b[i] * 2 ; 

/Ygure 45: Example of statement reordering 
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Software Pipelining 

This transformation parallelizes a loop body by scheduling instructions of different instances of the 
loop body. It is a powerful optimization to improve instruction-level parallelism. It can be used in 
conjunction with loop unrolling. In the example below, the preload commands can be issued one after 
another each i taking only one cycle. This time is just enough to request the memory areas. It is not 
enough to actually load them. This takes many cycles, depending on the cache level that actually has 
the data. Execution of a configuration behaves similarly. The configuration is issued in a single cycle 
waiting until all data are present. Then the configuration executes for many cycles. Software pipelining 
overlaps the execution of a configuration with the preloads for the next configuration. This way the 
XPP array can be kept busy m parallel to the Load/Store unit. 

Issue Cycle Command 



XppPreloadConfig(CFGl) ; 
for (i=0; i<100; ++i) { 

XppPreload(2,a+10*i,10) ; 

XppPreload(5,b+20*i,20) ; 

// delay 
XppExecute ( ) ; 



Issue Cycle Command 

Prologue XppPreloadConfig (CFG1 ) ; 
XppPreload(2, a, 10) ; 
XppPreload(5,b,20) ; 
// delay 

for (i=l; i<100; ++i) { 
XppExecute ( ) ; 
XppPreload(2,a+10*i,10) ; 
XppPreload(5,b+20*i,20) ; 

} 

XppExecute ( ) ; 
Epilog // delay 



Kernel 1 
2 
3 
4 



Figure 47: Example of software pipelining 

Vector Statement Generation 

This transformation replaces instructions by vector instructions that can perform an operation on 
several data in parallel. This occurs at the end of the vectorization process, and is only of interest if the 
target processor is a vector processor. 

for(i=0; i<=N; i++) a[0:N] =b[0:N]; 

a[i] = b[i] ; 

Figure 48: Example of vector statement generation 

3.2.3 Data-Layout Optimizations 

In the following we describe optimizations that modify the data layout in memory in order to extract 
more parallelism or prevent memory problems like cache misses. 
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Scalar Privatization 

This optimization is used in multi-processor systems to increase the amount of Darallelicr, » n A » -a 
unnecessary communications between the processing ti.^^^T^T^^^ 
temporary variable in a loop body, then each processing elsm^c^^J^L ^ ? * 
its computations with this private copy. ment can receive a copy of rt and achieve 

for(i=0;i <= N;i++) { 
c = b[i] ; 
a[i] = a[i] + c; 

} 

Figure 49: Example for scalar privatization 

Array Privatization 

^^optimization is the same as scalar privatization except that it works on arrays rather than on 

Array Merging 

tte iS wav t S a ! i ° n twnrf ™ la y° ut of by merging the data of several arrays following 

fte way they are accessed in a loop nest. This way, memory cache misses can be avoided The lavouf 

layout of bo* aLys where blocks o^£e^ 

St' ^ C3 ? e m l SSeS m aVOlded 35 bIocks fining arrays* aid b^taSS^ 
the cache when getting data from memory. Details may be found in [1 1]. 



f or ( j =1 ,- j <=N- 1 ; i++ ) 
for (j=l; j<=N; j++) 

r . *f il[j] = 0.25*(a[i-l] [j] +a[i][j-l] + 
a[i+l] [j] + a[i] [j+1]),* J 



r< m II II II 



Figure 50: Example for array merging 

3.2.4 Example of Application of the Optimizations 

A lot of optimizations can be performed on loops before and also after generation of vector statements 

Ttni r fl a se r nce t°& te / ions pr ° ducin s ■» op*™ 1 so1 ^ fofau loop 

heunTti^ nrX? * i™ Pr ° P ° Se & Way t0 ™ *» that follows a S,e 

beunstic to produce vectonzable loop nests. To vectorize the code we can use the All™ 

algonthm that uses statement reordering and loop distribution befoS Z££££?£Z£ 
L.^ ff A,fr ed J lth J°°P ^change, scalar expansion, index set splitting, node SttKS 
SSL^^^TT 8 ," 0 ^1 °u ^ ^ de P end «nce graph A sW^Ztnta 
tfnnTT , ! , P ?, 3 de P endence cvcle > hence optimizations are performed to break cycles or 
if not completely possible, to create loop nests without dependence cycles. The example oSsented 
below is mtended as an illustration for the use of the optimizations presented before P P 

mt™°i e Pr0Ce n S - S f T maj ° rS StepS - Firet * e Procedures are restructured by analyzing 

otiS dUre CaIk ^ the u°F b ° dieS *** to "«w ^m. Then some high-le7e ™Sw 
optmnzations are applied to the loop bodies to modify their control-flow and simplify towdJlE 
third step prepares the loop nests for vectorization by building perfect loop nests anKuTes ftat iimer 
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IZrlTt f B r?T lbto " — n ^ SPCCifiC » aPP^d which optimize the data 

different s^ps optimizations and code transformations may be applied between t£e 

The first step comprises procedure inlining and loop pushing to remove the procedure calls of the loop 
bodies. The second step consiste of loop-invariant code motion, loop unswitching, strength reduction 
and idiom recognition. The third step can be divided in several subsets of optimizations. Wefirst apply 
loop reversal loop normalization and if-conversion to obtain normalized loop nests. This allows 
SU!S£!n^ dependence graph. If dependences prevent the loop nest to be vectorized adequate 
formations are applied. If for instance, dependences occur only on certain, iterations, loop peeling 
or loop splitting can remove these dependences. Node splitting, loop skewing, scalar expansion or 

re ° r H ermg '""I bC £ PHed u m 0th6r CaSeS " L °°P ^change moves Awards the loop levels 
d f end ? nce c y* l T • ^ is * obtain perfectly nested loops with the loop levels 

de P ende n c e cycles as much outwards as possible. We subsequently apply loop fusion 
reduction recognition, scalar replacement/array contraction and loop distribution to farther improve the 
SSTTS"' , Y eCt0r Stateme ^ S eneration is Performed (using the Allen-Kennedy algorithm, 
6) la VT T 5 ! 5 * 5 ° f °P timizations ,ike loop tiling, strip-mining, loop unrolling and 
software pipelining which take the target processor into account. 

The number of optimkations in the third step is large, but not all of them are applied to each loop nest 
Following the goal of the vectorization and the data dependence graph only some of them are Tppned' 

2S2?aT ^Jr°^ de ^W 1 ™*™ of * e optimizations, that canbe applied seve^l thneslf 
needed. Let us illustrate this with an example. 

void f(int** a, int** b, 'int *c,int i, int j ) { 
^ a[i] [j] =a[i][j-l] - b[i+l] [j-1] ; 



void g(int* a, int* c,int i) { 
a[i] = c[i] + 2; 

} 



for(i=0; i<N;i++) { 
for(j~l; j<9;j=j++) 
if (k>0) 

f (a,b,i, j) / 
else 

g(d,c,j ) ; 

} 

d[i] = d[i+l] + 2; 

} 

for(i=0; i<N;i++) 

a[i] [i] = b[i] + 3; 



P t j hat 1 m . lmm S * e two procedure calls is possible, then loop unswitching is applied to 
nZlv ? COnd * tiOTl * mst ™ ctlon of me l0 °P body. The second step starts with Tpplymg loop 
normalization and analyses the data dependence graph. A cycle can be broken by applying loo£ 
mterchange as ,t is only earned by the second level. The two levels are exchanged, so mat thf hTer 
eve lis vectonzable Before that or also after, we apply loop distribution. Loop fhsicm is applied ZZ 

red, nH P J ™? ' is pulled OUt of conditional instruction by a traditiona" 

redundant code elnnmation optimization. FinaUy vector code is generated for the resulting loops. 

So in more details, after procedure inlining, we obtain: 
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for(i=0; i<N;i++) { 

for(j=l; j< 9;j== j ++) 

if (k>0) 

else fi]tj] " aCi]C:J " 13 - b Ci + U 
_ d[j] = c[j] + 2; 

^ d[i] = d[i+l] + 2; 

for(i=0; i<N;i++) 

a[i] [i] = b[i] + 3; 

After loop unswitching, we obtain: 

if (k > 0) 

for(i=0; i<N;i++) { 

for(j=l; j<9;j=j ++) 

a[i] [j] =, a[i] [j.u - b[i+1] jj^j 
d[i] = d[i+l] +2/ 

else 

for(i=0; i<N;i++) { 

for(j=l ; j<9;j=j++) 
d[j] = c[j] + 2; 
^ d[i] = d[i+l] + 2; 

for(±=0; i<N;i++) 

a[i] ti] = b[i] +• 3. 

After loop normalization, we obtain: 

if (k > 0) 

for(i=0; i<N;i++) { 

for(j=0; j<8;j=j++) 

a[i][j + i] =a[i][j] -b[i + l H j]; 
^ d[i] = d[i+lj +2; 

else 

for(i=0; i<N;i++) { 

for(j=0; j< 8 ;j=j ++) 

d[j] = c[j+l] + 2; 
^ d[i] = d[i+l] + 2; 

for(i=0; i<N;i++) 

a[i] [i] = b[i] + 3; 

After loop distribution and loop fusion, we obtain: 

if (k > 0) 

for(i=0; i<N;i++) 

for(j=0; j<8;j=j++) 
else ati HD+l] - a[i][j] - b[i + l ][j]; 

for(i=0; i<N;i++) 

for(j=0; j<8;j=j++) 

d[j] = c[j+l] + 2; 

for(i=0; i<N;i++) { 

d[i] = d[i+l] + 2; 
^ a[i] [i] = b[i] + 3; 
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After loop interchange, we obtain: 

if (k > 0) 

for(j=0; j<8;j=j++) 
for(i=0; i<N;i++) 

a[i][j+l] = a[i][j] - b[i+l][j]; 

else 

for(i=0; i<N;i++) 

for(j=0; j<8;j=j++) 

= c[j + l] + 2; 

for(i=0; i<N;i++) { 

d[i] = d[i+l] + 2; 
a[i] [i] = b[i] + 3; 

} 

After vector code generation, we obtain 

if (k > 0) 

for(j=0; j<8;j=j++) 

a[0:N-l] [j+1] =a[0:N-l][j] - b[0:N] [j ] ; 

else 

for(i=0; i<N;i++) 

d[0:8] = c[l:9] + 2; 

d[0:N-l] = d[l:N] + 2; 
a[0:N-l] [0:N-1] = b[0:N] + 3; 
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4 Compiler Specification for the 
PACT XPP 



4.1 Introduction 



A cached RISC-XPP architecture exploits its full potential on code that is characterized by high data 
locality and high computational effort. A compiler for this architecture has to consider these design 
constraints. The compiler's primary objective is to concentrate computational expensive calculations 
to innermost loops and to make up as much data locality as possible for them. 

The compiler contains usual analysis and optimizations. As interprocedural analysis, like alias 
analysis, are especially useful, a global optimization driver is necessary to ensure the propagation of 
global information to all optimizations. The following sections concentrate on the way the PACT XPP 
influences the compiler. 



4.2 Compiler Structure 

Figure 51 shows the main steps the compiler must follow to produce code for a system containing a 
RISC processor and a PACT XPP. The next sections focus on the XPP compiler itself, but first the 
other steps are briefly described. 
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Code Preparation 



Partitioning 



XPP G 


Dmpiler 


* 




RISC Code Gen. 



RISC Code Sched. 



Figure 51: Global View of the Compiling Process 



42.1 Code Preparation 

This step takes the whole program as input and can be considered as a usual compiler front-end It will 
prepare the code by applying code analysis and optimizations to enable the compiler to extract as 
many loop nests as possible to be executed by the PACT . XPP. Important optimizations are idiom 
recognition, copy propagation, dead code elimination, and all usual analysis like dataflow and alias 



Handling of Pointer and Array Accesses 

Pointer and array accesses are represented identically in the intermediate code representation which is 
built during the parsing of the source program. Hence pointer accesses are considered like array 
accesses in the data dependence analysis as well as in the optimizations used to transform the loop 
bodies. Interprocedural alias analysis, for instance, leads in the code shown below to the decision that 
the two pointers p and q never reference the same memory area, and that the loop body may be 
successfully handled by the XPP rather than by the host processor 
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int foo(int *p, int *q, int N) 
{ 

for(i = 0;i < N;i++) 
{ 

P[i] = q'[i] * q [i+1] ; 

} 

return p[N-l] ; 
} 

main ( ) 

int a [100] ,b[100] ; 
int N; 

foo (a,b,N) ; 

Figure 52: Example of pointer disambiguation 

422 Partitioning 

Partitioning decides which part of the program is executed by the host processor and which part is 
executed by the XPP. • v 

A loop nest is executed by the host in three cases: 

■ if the loop nest is not well-formed, 

■ if the number of operations to execute is not worth it to be executed on the PACT XPP, or 

■ if it is impossible to get a mapping of the loop nest on the PACT XPP. 

A loop nest is said to be well-formed if the loop bounds are computable and the step of all loops is 
constant, the loop induction variables are known, and if there is only one entry and one exit to the loop 

If the loop bounds are constant but unknown at compile time it is possible to speculatively generate 
XPP code which assumes adequate iteration counts (loop tiling). But small loop iteration counts at run 
tune can drive generated XPP code towards inefficiency. One possible solution is the introduction of a 
conditional instruction testing whether the loop bounds are large enough for profitable XPP code Two 
versions of the loop nest are produced. One for execution on the host processor, and the other for 
execution on the XPP. This concept also eases the application of loop transformations needing 
minimal iteration counts. s 

4.2.3 RISC Code Generation and Scheduling 

After the XPP compiler has produced NML code for the loops chosen by the partitioning phase the 
mam compiling process must handle the code that will be executed by the host processor where 
instructions to manage the configurations have been inserted. This is the objective of the last two 
steps: 

■ RISC Code Generation and 

■ RISC Code Scheduling. 

The first one produces code for the host processor and the second one further optimizes further by 
looking for a better scheduling using software pipelining for instance. 
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4.3 XPP Compiler for Loops 

nZ^ 53 ^ sc f }es *■» internal processing of the XPP Compiler. It is a complex cooperation between 
pro-am transformations included in the XPP Loop Optimizations, a temporal pLtionmgX? 
NML code generation and the mapping of the configuration on the PACT XPP. 



exit 




Temporal Partitioning 



yes 



Figure 53 .Detailed Architecture of the XPP Compih 



S u ^ 8 o^Si°°? ° ptimizatio " s «* a PP' ied * P^uce innermost loop bodies that can be 
executed on the array of processors. If case of success, the NML code generation phase is called 
otterwise temporal partitioning is applied to obtain several configurationf for one loo TILS, 

armvr^cL^ T "T? ^ * iS P ° SSible ** a confi ^tion will not fit'mto SVpaE 
amy. In this case the loop optimizations are applied again with respect to the reasons of failure of the 

S«t gene T n ^ 0f ?* maPP i n§ ; If MS nCW appUcati0n ° f Io °P optimization^ £^c££ 
^ode temporal partitioning is applied. Furthermore we keep track of the number of attempts 2 
NML Code Generation and the mapping. If too many attempts are made, and we still do not obteina 
solution, we break the process, and the loop nest will be executed by the host processor 

4.3.1 Temporal Partitioning 

Temporal partitioning splits the code generated for the XPP in several configurations if the number of 
operations, i.e. fte S1 ze of the configuration exceeds the number of operatio^executeble mTsmele 

a^S n fi ThlS r iSf T ati0n ^ l0 ° P disS6Verin S M- These'configuratirS l^atedt 
a loop of configurations whose number of execution corresponds to the iteration range of the^riginal 
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4.3.2 Generation of NML Code 

This step takes as input an intermediate form of the code produced by the XPP Loop Optimization, 
step, together with a dataflow graph built upon" it. NML code is then producedby Istag J£T5K 

SaTce ""S te , Chn r eS [ \ 2 ' 13] - **" **• S P ecific optimizations 1 a PP Hed For 
222/ f ^^y. elmimation and boolean simplification dedicated to optimizing Z 
generated event networks are invoked. ^ 8 1116 

4.3.3 Mapping Step 

F^Gs^R^r 6 ° f T PP ? S *! ^ m ° duleS ° Q * e ^ * P laci "g *• °P e ^<™ on the ALUs 
fREGs, and BREGs, and routing the data through the buses. ' 

4.4 XPP Loop Optimizations Driver 

The objective of the loop optimizations used for the XPP is to extract as much parallelism as possible 

possible and to avoid memory bottlenecks by means of IRAM usage. The following sections SaS 
how they are organized and how to take into account the architecture for applying ^oSzatxoTs 

4.4.1 Organization of the System 

™« ^rT^ or g. an ! zat ! on of *e loop optimizations. The transformations are divided in six 
groups. Other standard op imizations and analyses are applied in-between. Each group is called several 

SnSi P H T r S6V f ^ ^ may al5 ° ° CCUr - ^ number of iterati °ns for each driver loop 5 
constan or determined at compile time by the optimizations itself (e.g. repeat until a ™Su£ code 
quality is reached) In the first iteration of the loop, it can be checked if 'loo,? nes^an, us7bte fo £ 

S the ZZ l V ; reCted t0 u Cb , eCk ^ l0 ° P b0Unds etC " For instance if "eTi ^ well-tatd 

and ttie data dependence graph does not prevent optimization, but the loop bounds are unJownTen 

be^r 2? ?° n i°° P ? Img iS appUed t0 get 311 innarttort l0 °P ** i- —tor to hano^anTc'an £ 
better optimized, and in the second iteration, loop normalization, if-conversion, loop interchange and 
other optimizations are applied to effectively optimize the loop nest for the XPP. ercna nge and 

2^oJiL en i SUreS that no P rocedur ? caUs occur ^ the loop best. Group H prepares the loop bodies by 
removing loop-mvanant instructions and conditional instruction to ease the analysis Group ffl 
generates oop nests suitable for the data dependence analysis. Group IV contains optmiizaZs to 

tf t0 ° btain u data de P endence that are suitable for vectorizaS Grouo V 

contams optim^ations ensunng that innermost loops can be executed on the XPP. Group VI coSns 

Ste I . " ^ ParaUeliSm fr ° m l0 ° P b ° dieS - Gr ° U P VI1 -ntains^t S 

In each group the application of the optimizations depends on the result of the analysis and the 

°'V°T HeDCe ' J* *° appIkati0n of a transformatio^t of Wp 

1 V depends on the data dependence graph computed before. v«uu F 
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Group I 

procedure miming 
loon mishit. 



Group n 

loop-invariant code motion 
loop uns witching 
strength reduction 
induction variable elimination 



Group m 
loop reversal 
loop normalizatio 
if-conversion 


1 






Group IV 
loop peeling 
loop splitting 
node splitting 
loop skewing 
scalar expansion 
statement reordering 




f 


Group V 
loop interchange 
loop distribution 
loop collapsing 
loop tiling 
strip- mining 
loop alignment 





• 

Group VI 




loop fusion 




reduction recognition 




scalar replacement 




loop unroUing/unroll&jam 




f 


Group Vn 




Data duplication 




Shift register synthesis 




Loop pipelining 




Tree balancing 





Figure 54:Detailed View oftheXPP Loop Optimizations 



4.4.2 Loop Preparation 



S?s crZSnnT °^°T "^J" C ° mpiler S enerate loo P bodies procedure 

calls, conditional instructions and induction variables other than loop control variables. Thus loop 

Ss^^olTTf °° PS T t SU ? ble , for execution on the XPP, are obtained. The iteration^ 
nmges are normalized to ease data dependence analysis and the application of other code 
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4.4.3 Transformation of the Data Dependence Graph 

The optimizations of Group IV are performed to obtain innermost loops suitable for vectorization wirt, 
respect to the data dependence graph. Nevertheless a difference wSh usual vS2£ST TCt a 

b~p"^?2ir <1 *" diaribu,ion - ™ s ™^ — whioh^fMS 

4.44 Influence of the Architectural Parametere 

Sta ™£T l0 ° P Unr0 " ing - loop"^,^ 

Tta table below lists the parameters that influence the application of the optimizations For each „f 
ftem too vetoes are given: a starting vatae compared from the loop, and a rSSn value wS t 
vL f ^T^t Sh ° Uld ° r sboxxld not ^ ^ *e Wlica?^^^^ 
r£S Ztf P *' nUmb r of elOTMte 32-Mt data) of an a2a y access?dta tto toT^T 
B^?^? ^represents the amount of data that must fit in the cache. I/O IRAMs, M.V ^tRM 

^ , a n T ber J l f FREOs - «»» MOi respectively that crmXtTtS 

XPP. The dataflow graphw,dth represents the number of operations that can be executed to rSel in 
ti» same pipehne sage. The dataflow graph height represents the length of the pSeltae CorSrS ion 



decrease a parameter's value (-), 
increase a parameter's value (+), 
not influence a parameter (id), or 

adapt a parameter's value to fit into the goal size (make fit). 



Furthermore, some resources must be kept for control in the configuration; this means that the 
optumzations should not make the needs exceed more than 70-80% of eaTresource 
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r\- i/J!j.rzuU4/UUo547 


Parameter 


Goal 


Starting Value 


Vector length 


IRAM size (128 words) 


Loop count 


Reused data set size 


Approx. cache size 


Algorithm analysis/loop sizes 


I/OIRAMs 


XPPsize(16) 


Algorithm inputs + outputs 


ALU 


XPP size (< 64) 


ALU opcode estimate 


BREG 


XPP size (< 80) 


BREG opcode estimate 


FREG 


XPP size (< 80) 


FREG opcode estimate 


Dataflow graph width 


High 1 


Algorithm dataflow graph 


Dataflow graph height 


Small 


Algorithm dataflow graph 


Configuration cycles 


^ command line parameter 


Algorithm analysis 



Here are some additional notations used in the following descriptions. Letw be the total number of 
processing elements available, r, the width of the dataflow graph, in, the maximum number of input 
values in a cycle and out, the maximum number of output values possible in a cycle On the XPP n is 
the number of ALUs FREGs and BREGs available for a configuration, r is the number of ALUs, 

fh^wS f ™ e Y[^ alM m *» Same sta S e ™* out amount to 

the number of available > IRAMs. As IRAMs have 1 input port and 1 output port, the number of ERAMs 
yields directly the number of input and output data. 

The number of operations of a loop body is computed by adding all logic and arithmetic operations 
occurring m the instructions. The number of input values is the number of operands of the instructions 
regardless of address operations. The number of output values is the number of output operands of the 
instructions regardless of address operations. To determine the number of parallel operations input 
and output values as well as the dataflow graph must be considered. The effects of each transformation 
on the architectural parameters are now presented in detail. " *" 

Loop Interchange 

Loop interchange is applied when the innermost loop has a very small iteration range. In that case 
loop interchange allows having an innermost loop with a more profitable iteration range. It is also 
influenced by the layout of the data in memory. It is profitable to data locality to interchange two loops 
to get a more practical way to access arrays in the cache and therefore prevent cache misses It is of 
course also influenced by data dependences as explained earlier. 
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Parameter 


Effect 


v ecior lengtn 


+ 


Reused data set size 


make fit 


1/vJ livAIVtS 


id 


AT TT 


id 


DDTJP, 

atsjz\j 


id 


FREG 


IQ 


Dataflow graph width 


id 


Dataflow graph height 


id 


Configuration cycles 





Loop Distribution 

Loop distribution is applied if a loop body is too big to fit on the XPP. Its main effect is to reduce the 
££2£ ^ ^ Reduc ^ * e — ^ IRAMs is a sli: lff2 of this 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/O IRAMs 


make fit 


ALU 


make fit 


BREG 


make fit 


FREG 


make fit 


Dataflow graph width 




Dataflow graph height 




Configuration cycles 





Loop Collapsing 

^Z£ l lZ P f g « US6d t0 ^ l0 ° P b0dy 1136 more memo <y resources. As several dimensions are 
merged, the iteration range is increased and the memory needed is increased as well. ™ enS1 ° nS 316 
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Parameter 


Effect 


Vector length 


+ 


Reused data set size 


+ 


I/OIRAMs 


+ 


ALU 


id 


BREG 


id 


FREG 


id 


Dataflow graph width 


+ 


Dataflow graph height 


+ 


Configuration cycles 


+ 



Loop Tiling 

Loop tiling, as multi-dimensional strip-mining, is influenced by all parameters, it is especially useful 
when the iteration space is by far too big to fit in the IRAM, or to guarantee maximum execution time 
when the iteration space is unbounded (see Section 4.4.7). Loop tiling makes the loop body fit with the 
resources of the XPP, namely the IRAM and cache line sizes. The size of the tiles for strip-mining and 
loop tiling can be computed by 

tile size = resources available for the loop body / resources necessary for the loop body 

The resources available for the loop body are the whole resources of the XPP for the current 
configuration. One tile size may be computed for the data and another one for the processing elements. 
The final tile size is the minimum of these two computations. If, for instance, the amount of data 
accessed is larger than the capacity of the cache, loop tiling can be applied which is shown be the 
following example. 

for(i=0;i <= 1048576;i++) for(i=0; i<= 1048576; I+= CACHEJSIZE) 

<loop body> for(j=0; j< CACHK_SI2E; j+=IRAM_SIZE) 

for(k=0; k<IRAM_SIZE; k++) 
■ctiled loop body> 

Figure 55: Example of loop tiling for the PACT XPP 
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Parameter 


Effect 


Vector length 


make fit 


Reused data set size 


make fit 


I/O IRAMs 


id 


ALU 


id 


BREG 


id 




id 


Dataflow graph width 


4- 


Dataflow graph height 


+ 


Configuration cycles 


+ 



Strip-Mining 



217 .wu ° h ^ am .° Unt ° f memOTy accesses of innermost loop with the IRAM 

capacity. Usually the necessary number of processing elements does not build the bottleneck aTtihe 
XPP provides 64 ALU-PAEs which is sufficient to execute most single loop bodief l^wever fte 
number of operations can be also taken into account the same way as the data » owe ver, the 



Parameter 


Effect 


Vector length 


make fit 


Reused data set size 


id 


I/O IRAMs 




ALU 


id 


BREG 


id 


FREG 


id 


Dataflow graph width 


+ 


Dataflow graph height 


id 


Configuration cycles 


id 



Loop Fusion 

hnZ^ Si ° n iS a PP Iie< ? when a l0 °P bod y does not use enough resources. In this case several loop 
bodies are merged to obtain a configuration using a larger part of the available resources. P 
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Parameter 


L Effect 


Vector length 


id 


Reused data set size 


id ~~* 


I/OIRAMs 


+ 


ALU 


+ 


BREG 


+ 


FRFfi 

I XVCVJ 


+ 


Dataflow graph width 


id 


Dataflow graph height 


+ 


Configuration cycles 


+ 



Scalar Replacement 

The amount of memory needed by the loop body should always fit into the IRAMs Due to this 
optimization, some input or output array data is replaced by scalars, that are either stored in FREGs or 
Kept on, buses. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 




ALU 


id 


BREG 


id/+ 


FREG 


id/+ 


Dataflow graph width 


id/- 


Dataflow graph height 


id/- 


Configuration cycles 


id 



Loop Unrolling / Loop Collapsing / Loop Fusion 

Loop ""rolling, loop collapsing and loop fusion are influenced by the number of operations within the 
body of the loop nest and the number of data inputs and outputs of these operations, as they modify the 
size of the loop body. The number of operations should always be smaller than«, and the number of 
input and output data should always be smaller than in and out. Note that although the number of 
configuration cycles increases, the throughput increases as well resulting in a better performance 
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i ai d meter 


Effect 


Vector lenffth — — 


__ 


Reused data set 


id 


I/OIRAMs 


+ 


ALU 


+ 


BREG 




FREG 


+ 


Dataflow graph width 


id n 


Dataflow graph height 


+ 


Configuration cycles 


+ 



Loop Distribution 

operations should always bo^^^ SSo^Tf- 1116 number * 

smaller than « and out The foUowint^h'l. " Umber °1 Md °"*"t data should always be 

the loop distribution g taWe deSCnbes *" effect for each * &e loops resulting from 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 




ALU 




BREG 




FREG 




Dataflow graph width 


id 


Dataflow graph height 




Configuration cycles 





Unroll-and-Jam 



access, m ft. ^ loop . me fol , ^ « *m^rr.T S oTm ^ 

the number of operations of lie ne» mner loop miKt also fit on the P^CT «P £. 

compute, o y the same formula: . J^^^ 
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bIttwSrfomiS 6 ' ° f COnfigUration cycles increases - * e throughput increases as well resulting in 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


+ 


I/O IRAMs 


+ 


ALU 


+ 


BREG 


+ 


FREG 


+ 


Dataflow graph width 


id 


Dataflow graph height 


+ 


Configuration cycles 


+ 



4.4.5 Target Specific Optimizations 

At this step other optimizations, specific to the XPP, may be applied. These optimizations deal mostly 
wrth memory problems and dataflow considerations. This is the case for shift register synthesis L»t 
data duplication (similar to scalar or array privatization), and loop pipelining. 

Shift Register Synthesis 

This optimization deals with array accesses occurring during the execution of a loop body When 
several values of an array are alive for different iterations, it is convenient to store them in registers 
rafter than accessing memory each time they are needed. As the same value must be stored in different 
rasters depending on the number of iterations it is alive, a value shares several registers and flows 
t ° I ano F ther at u eac , h iteration - ft is sim «ar to a vector register allocated to an array access 
with the same value for each element This optimization is performed directly on the dataflow graph 
by inserting nodes representing registers when a value must be stored in a register In the PACT XPP 
it amounts to store it in a data register. A detailed explanation can be found in [1]. 

Shift register synthesis is mainly suitable for small to medium amounts of iterations where values are 

Sn^ mCe ^ p !P el ™; leng& ^ reases ^ each heration for which ^ value has to be buffered the 
following method is better suited for medium to large distances between accesses in one input array 

hv V i eleSS W ° rkS Ve ? WeU for hM 8 B Posing algorithms which mostly alter a pixel 

by analyzing itself and its surrounding neighbors. Some resources are needed to produce guards on 

m11 0Ut P ut 7 aIue * to ensure the semantics of the produced code, as all register must be filled to 
allow the code to produce correct values. 
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Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O IRAMs 


id 


A T T T 
ALU 


+ 


tJKJbG 


id/+ 


FREG 


+ 


Dataflow graph width 




Dataflow graph height 


+ 


Configuration cycles 


+ 



input Data Duplication 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/O IRAMs 


+ 


ALU 


id 


BREG 


id 


FREG 


id 


Dataflow graph width 


+ 


Dataflow graph height 




Configuration cycles 


id 



FIFO pipelining 



This optimization is used to store an array in the memorv of the PACT xpt> 

is smaller than the total amount of memory of the PACT kt S * "? anv 
can be used for input or output data. *^^^£££££& ZLlS^l 

loop tiling/strip-mining to make an array fit onfte EACTOTP. dS t0 appIy 
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Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 


+ 


ALU 


id 


BREG 


id 


FREG 


id 


Dataflow graph width 


id 


Dataflow graph height 




Configuration cycles 


+ 



Loop Pipelining 

This optimization synchronizes operations by inserting delays in the dataflow graph. These delays z 
registers. For the PACT XPP, it amounts to store values in data registers to delay the operation usi 
them. This is the same as pipeline balancing performed by xmap. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


+ 


FREG 


+ 


Dataflow graph width 


+ 


Dataflow graph height 


, -/id 


Configuration cycles 





Tree Balancing 

This optimization consists in balancing the tree representing the loop body. It reduces the depth of the 
pipeline, thus reducing the execution time of an iteration, and increases parallelism. 
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Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Dataflow graph width 


id 


Dataflow graph height 




Configuration cycles 





4.4.6 Memory Optimizations 

Optimization of Memory Accesses 

A particular concern for the PACT XPP are memory accesses. These need to be reduced in order to get 
enough paraUelism to exploit. The loop bodies are freed of unnecessary memory acces es wh en shfft 
register synthesis and scalar replacement are applied. Scalar replacement hal the same effect as 
redundant load/store elimination. Array accesses are taken out of the loop body and handled by the 

LcLThTl f ° Uld ^ redUndant I0ad/St0re el ^^tion takes care not only of ^ 

accesses but also of accesses to global variables and records. On the other hand, shift register synthesis 
removes some accesses completely from the code. ^ synmesis 

Access Patterns and Loading of the Data into the IRAMs 

ZOO! of SSUe i ' alS0 t h0W to 10 ^ data in the IRAMs efficiently in terms of resources consumed and in 
terms of execution time. Non linear access patterns consume a lot of resources to compute tie 
addresses, moreover their loading into the~IRAMs can then be delayed by cache misses and these 

SSt^P^ fer to — - ^ J » *° — - 2= 
dtt&StSf to section 2 - 2 - 3, me * ods exist to problems - Th « r <*" be « 

■ on the data layout, 

■ the source code, or 

■ on the data transfer. 

By modifying the data layout, the access patterns are simplified, thus saving resources and 
computation time. This is achieved by array merging, for instance resources and 

The source code itself can be modified to simplify the access patterns. This is the case for matrix 
multiplication, presented in the case studies, where a matrix is transposed to obtain an access line-bv- 

tiJSr,?. T Z7v y T W ^^ ^ eX3mple preSented at *» end of 48 section - ° n other hand, loop 
tiling allows filling the IRAMs by modifying the iteration range of the innermost loop. 

Furthermore the access patterns can be modified by reordering the data. This can happen in two ways 
as. already described in section 2.2.5: y ' 
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■ either by loading the data in the IRAMs in a specific order, 

■ or by reordering dynamically the data. 

The first data reordering strategy supposes a constant stride between two accesses, if this is not the 
case, then the second approach is chosen. More resources are needed, as the flow of data is reordered 
by computations done the PACT XPP to feed the ALU-PAEs, but the data are accessed linearly inside 
the IRAMs. 

Finally if none of these methods is applicable, and the access patterns are too costly to be synthesized 
on the XPP array, the index expressions are computed in advance and loaded into an ERAM that is 
used as an index for accessing the array values stored in another ERAM. For instance with the 
following loop the values {0,0,0,1,1,1,...,7,7,8} are loaded in an IRAM, and will feed the address input 
of the IRAM containing array b. 

for(i=0;i <= 24;i++) 
a[i] = b[i/3] ; 



In this example, where only one expression causes problem, another solution is to apply loop tiling to 
prune it. The resulting loop is shown below. The expression i/3 evaluates to 0, as it is always smaller 
than 3. This is found by the value range analysis. The access pattern can then be synthesized on the 
XPP array to access the array values in the ERAMs. 



for(j=0;j <= 7;j++) 
for(i=0;i < 3;i++) 

a[i+3*j] = b[i/3+j]; 

} 



for{j=0;j <= 7;j++) 

for(i=0; i < 3; i++) { 
a[i+3*j] = b[j]; 

} 



4.4.7 Limiting the Execution Time of a Configuration 

The execution time of a configuration must be controlled. This is ensured in the compiler by strip- 
mining and loop tiling that also take care that the input data does not exceed the ERAMs capacity This 
way the iteration range of the loop that is executed on the XPP is limited, and therefore its execution 
time. Moreover partitioning ensures that loops, whose execution count can be computed at run time 
are going to be executed on the XPP. This condition is trivial for for-loops, but for while-loops where 
the execution count cannot be determined statically, a transformation like the one sketched below is 
applied. As a result, the inner for-loop can be handled by the XPP. 

while (ok) { while (ok) 

<loop body> for(i=0; i<100 && ok; i++) { 

J <loop body> 

} 

Figure 56: Transformation of while-loops 
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5 Case Studies 



5.1 Introduction 



^ iTnfp P ,T S,X C3Se , StUdl6S fr0m fields where a RISC-XPP combination fits best 
Atypical DSP examples a finite nnpulse response (FIR) filter and a viterbi decoder are investigated 
taage processmg algontbins are represented by an edge detector function, the inverse discretecole 
fransformation from an MPEG codec and a wavelet transformation. Furthermore a rSS 
multiplication and the quantization functions of the MPEG codec are investigated 



JT J ansfo f med . wrth var ious optimizations presented in the preceding chapters The 
uuaL „ ^ sf ° mataon V S Panted in C code, which is sometimes shortened for better 
understanding. In a last step the code is split in C code, which runs on the RISC hoi and C code 

whit r S ,H n *° T ^ configuration is presented as a data^ow g^ph 

SS^c?352^ a m,d "~ since sorae features of 4116 " e 



5.2 Conventions 

5.2.1 Configuration and IRAM names 

Configurations are named by a prefix __XppCfg_ and a name. They are defined as C functions without 
parameters and without a return value. ummons witnout 

The communication with the rest of the system is done over the IRAMs exclusively They are 

ZEt&SRL £Z££2; In * e c representation of — « « 

" ^^^^ 

access mode best. IRAMs in tins mode are read and written Boqp^^n^^S^ 
No addre^ generators are needed. The access is illustrated by using the post increment notation 
iram<N>++ . When the declaration is of a smaller data type than integer this silentlv inrnlie* 
that converters to 32 bits are produced by the compiler. y P 

- As arrays of type (unsigned) chor[512], short[256], or mt[128], respectively. The access notation 
in C is then iram<N> [offset expression]. In contrast to FIFO access dedicated address 
generators must be synthesized. As mentioned above, the usage of data types smaller than mteger 
implies automatically generated data type converters. "er inan integer 

All code parts outside a _Ap/7C/g_-prefixed function are meant to run on the RISC host The RISC 
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5.2.2 Endianess 

We assume big endian data layout. This means that the string representation of the word "PACT XPP" 
loaded to an IRAM causes the following IRAM content. 



Address 


Content 


0x00 


0x50414354 (*P'«24 | 'A' «16 1 'C « 8 | T) 


0x01 


0x20585050 (' '«24 | 'X' «16 | 'P' « 8 1 'P') 



Similarly, loading an array of 4 16-bit (short) values with the values 0x1234, 0x5678, 0x9abc and 
OxdefO respectively, causes the following content. 



Address 


Content 


0x00 


0x12345678 


0x01 


0x9abcdef0 



There is no special reason for this choice, little endian order would be possible, too. Of course the 
predeiined modules m the next section must then be adapted to the changed data layout 

5.2.3 Predefined Modules 

For better readability of the examples some predefined modules are used. In the following subsections 
tney are shortly described and their dataflow graphs are given. 

Up counters 

The counters are used on one hand to drive the IRAM reads and writes and, on the other hand to 
generate event sequences for the conversion modules presented next The different implementations 
are described m [12] in detail. 

Conversion Modules 

Predefined conversion modules are used throughout the case studies. The compiler handles them as 
compiler known functions. The compiler either generates conversion modules which produce a 
sequential stream of converted values, or it generates modules which simply split packets into parallel 
streams which then can be processed concurrently. Figure 57 shows the implementations of the 
converters which convert to one stream. They output one 8/16-bit value per cycle. The input 
connectors expect data packets with packed values of the shorter data type. Furthermore the selector 
inputs need special event sequences for correct operations. 

The second type of converters, which can only be used if dependences allow it, simply split a data 
packet in 2 or 4 streams with boolean operations, and do a sign extension if necessary. Since the 
implementations are straightforward, the dataflow graphs are omitted. 
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selector 



13 



~ l y i ~~ t ~ : i@ojo| 
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24 
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x y' " 




T onl32 




o«8 i r 



@<M1 




•^•ILrTT m °?^f°r c °™™°r>from and to shorter data types. The signed versions suffixed with 

-I , t °Z o ?'^^ temWnA " m0dul ™ 16 - bit ™^^ '101010 'eventstreams 

Wiethe '32to8>^er£ Sm ^ ^ m ^ b ™^~_ 

00010001... sequence, respectively. All modules output one packet/cycle. 



an a 



5.3 Performance Evaluation Procedure 



5.3.1 Target Hardware Platform 
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Unit Frequency 

RISC core 400 MHz 

XPP Cache Controller 400 MHz 

XPP PAE Array 200 MHz 



Storage 
ICache 

DCache 



IRAMs 
Bus 

ICache - PAE 
DCache - IRAMs 
SDRAM 



Frequency 

400 MHz 
400 MHz 
100 MHz 



1 preload FIFO stage 

8x8 ALU PAE's, 16 IRAM ports, 4 I/O Ports 



Frequency Size 
400 MHz 64 KB 

400 MHz 128 KB 



400 MHz 32 KB 



fully associative 

cache line 32 Bytes 

fully associative 

cache line 32 Bytes 

write-back / write allocation 

1 6 ports x 4 shadows x 128 ints x 32 bits 



Bus width Max 

Throughput 



32 bit 
128 bit 
32 bit 



1600 MB/s 
6400 MB/s 
400 MB/s 



Read Burst 7-1-1-1-1-1-1-1 
Write Burst: 1-1-1-1-1-1-1-1 



As a simplification, we do not consider alignment, assuming a cache miss every thirty-two bytes, when 
reading succeeding memory cells. We may do this, because we potentially omit only a single cache 
miss, that potentially occurs, if the array spans one more cache line due to misalignment. 
Execution times, in 400 MHz cycles: 

Resource 



ICache Hit: 



DCache Hit 



ICache -> 
PAE Array 



t( data size [bits] ) [400 MHz cycles] 
ceil(datasize/32) 



DCache -> IRAM or ceil(data size / 128) 
IRAM -> DCache 



Cache Read Miss RAM -> Cache 



Cache Write-Back Cache -> RAM 



roundUp(data size, 256 ) / (8*32 / ((7 + 7*1) * 4) 
= ceil(data size * 56 / 256) 

roundUp(data size, 256 ) / (8*32 / ((8*1) * 4) 
= ceil(data size*32 / 256 ) 
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Cache Write Miss ERAM->RAM: Cache Read Miss + Transfer(Write) 



Cache Read Miss + 
Write Transfer 
(IRAM -> Cache) 



= ceil(data size * 56 / 256) + ceil(data size / 128 ) 



Execution PAE Array Configuration execution cycles * 2 

Whenever there are no pipeline stalls, the different units and busses can work in parallel. Thus the total 
execution tune is defined by the following formula, where RAM transfer cycles summarizes the cycles 
ot the cache read misses and the cache write-back cycles: 

max( Sum (Execution cycles), 

Sum (RAM transfer cycles), 

max (Sum(ICache transfer cycles), 

Sum(DCache transfer cycles)) ) [cycles @ 400 MHz] 

If there are pipeline stalls, the outer maximum is replaced by a sum, reflecting the fact, that the units 
nave to wait for each other to finish. ' ~ 

Only the amount of data that actually has to be transferred, is considered. Data that is already in a 
cache or in the ERAMs, is not accounted for. /ma 

For the startup case, the caches are assumed to be empty. Only the read data is considered, as the 
wnte-backs of the first iteration will take place in the next iteration. Due to the dependences, the above 
formula changes to a sum over all configurations of the following -per configuration- term: 

ICache read miss + 

max ( ICache transfer cycles, Data cache read missi + 

Sumj= 2 .„-i( max (Data cache read missi, DCache transfer; i) ) + 
DCache transfer,, ) + 
Execution cycles [cycles @ 400 MHz] 

This double sum converges to the previous formula for any non-trivial number of IRAM preloads 
Also the RAM cycles dominate the transfer cycles by an order of magnitude. Therefore this more 
complicated computation method is only used for the trivial cases. 

For the average case only data, that are read for the first time, are accounted for. The average case is 
defined as the iteration after an infinite number of iterations: all data that can be reused from the 
previous iteration are in the cache. All data that are used for the first time must be fetched from RAM 

Schfandft^RAM d ' ^ "* redefined by * e next iteration have to be ba <* to *e 

The use of die '.XppPreloadClecm instruction is a special case: no write allocation takes place, except at 
the start and the end of the array, if it is not aligned to a cache line boundary. These burst transfers are 
neglected. Also no read transfer from the cache to the IRAM takes place. 

5.3.2 Evaluation Procedure 

As mentioned above, all examples are transformed with various transformations and intermediate 
results are presented in C code on a regular basis. Wherever possible it is tried to present valid C code 
Nevertheless m some examples it is necessary to use features which are not expressible in the source 
language. These then appear m comments within the source code. 
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After the partition step, configurations are hand written in NML to simulate the compiler code 
generation step Placement and routing is done automatically by the mapping tool XMAP For 
convenience the NML feature to define modules is used. In some cases, the objects in the critical' pa* 

SasSly t0 0thCT ' 38 ^ Pr ° Ven t0 faproV0 ^ eXeCUti ° n P erfoima ^ 

Each example lists the estimated data transfer performance in a table as the one below. The estimation 
assumes a cache controller which works with the RISC frequency which is twice the frequency of the 
XPP array, and four limes the frequency of the 32-bit main memory bus. The Cache-IRAM transfers 
are executed with fullcache controller speed over an 128-bit bus. All values are scaled to the cache 
controller frequency. The table below shows a typical data transfer estimation. ' '* 




«■ ,4-14 cacneycycles 
reladfoissl? 



vie bytes/cycle 



4-14 cycles penalty for cache write miss 
:. £(writeal.locatiorr): *,size*474, transfeir. 

ft? :^W^'-^ :cycleS.;C; ^ ' f . ! . 



A cache read mas causes a 14 cycles penalty for the burst transfer on the main memory bus which 
calculates to 4*14=56 cache cycles to load a 32 byte cache line from main memory. If a write miss 
occurs, the cache controller write allocation must first load the affected cache line before it can be 
altered .and 1 written back. By using XppPreloadCleon, write misses can be avoided. Then only the 
cache-RAM transfer with a 32-bit word every 4 cache cycles must be accounted for. For this reason 
some examples show a smaller number of write-back cache misses than expected. 

The XPP execute cycles are calculated by taking the double cycle difference (scaling to cache 
S qi S2 f) U 6nd ° f ""rife™**™ execution and the start of the configuration execution 

The NML sources are implemented so that configuration loading and configuration execution do not 
overlap. This is done by means of a start object which is configured last and creates an event to start 
execution. The cycle measurements for the XPP only include the code which is executed in the 
configurations, i.e. in the loops of the evaluated function. The remaining control code ie if 
statements, is not included. It is possible to neglect this remaining code on the RISC processor since 
this code is executed in parallel to the XPP and is significantly shorter. 

On the reference system, this code is executed in sequence to the code of the configurations so it 
cannot be neglected. Moreover, splitting the code for the reference system into many small units 
prevents many optimizations for that system, making the measurements unrealistic. Thus the complete 
loop is timed on the reference system for those cases studies that suffer most from these effects. 

The performance ^dato of die reference system were measured by using a production compiler for a 32 
bit fixed point DSP with a maximum instruction issue of four, an average instruction issue of 
a ^T"? y ^ a ° n f Cyde memor y ^cess to on-chip high speed RAM. This allows to simply 
add the data cache miss cycles to the measured execution time to obtain realistic execution times for a 
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memory hierarchy and off-chip RAM. Since the DSP cannot handle 8-bit data types reasonablv the 
sources were adapted to work with short, int and long types only to get represS J reTuTts ^ 

The results are summarized in another table. An example is shown below. All values are converted tn 
toughest frequency (cache I RISC cycles). For each configuration the 

HSFS^ ^tl m t"* f ° r ^ ^ aCC6SSeS - ™ en ^ execu ^ "c^ST ar?Svtn 
™w 6 ^ fer T enCe SySt6m - Finall y 48 s P eedu P is Panted as reference exeS2n 

cy cles / XPP execution cycles. Using the formulas of section 5.3.1, execution cvcles md snSZ Z 
given for all feee different possibilities, where the data can be boated iS^^S^jZ, 
core- for the XPP only, for the RISC, the in-cache column is used instead), in-cache <ar uvRAM 

In the example performance evaluation table below the first three rows list the performance data of 

each configuration separately, and the last row lists the performance data of all conTg^oL o^5 

function The data transfer cycles for the separate configurations,^^, rep^e S 

and wnte-backs whxch would be necessary for executing the configuration a one. 

Si + r eX T tmg f co ^ ati <™ is less than the sum of the cycles foT ft? senate 

S2^&isr remain m *• IRAMs or * *■ cache be ~° 

Usually the configurations are executed in a loop. Therefore the first table describes the first iteration 
of the example loop. All configurations are not in the cache, as are the required input Za^So ou*ms 



configurations 
contigurationl 
configuration2 
configurations 
all ergs 



828 
536 
427 

w 



"35 
17 
16 
"37 



C^jprpofl* 
"" r^AM v 15acrie 



1377 



""5688" 

3024 429 
1736 2 45 
2051 



Cache^RAJV 
'Mm 1377M©Sf 

429g|5# 
-^•W 245 ! i-2163 



Ref: 

"Cacn'eil^tf 




0,8 



have been computed so far, so no write-backs take place. 



2051^10 



In the second table, the average case is described: All configurations are cached in the XPP arrav as 

SSZXfZSZX™ be reused frora * e previous iteration - 



configurations 


RAM BCache 


t^^ffl^ti6r^> 
'RAM TcacR'e 




§f''Cachet\lRii 


Rep^s^if^ 
"Cachel'R^I 


Cache 




contigurationl 
configuration2 
configurations 


1352 52 
536 17 
760 32 






l| 366 j; #352 
Z6j 76: ipsa 


3624f"4976 
256pg 
192^^ 


9,9 

:>^61 4,6 
%£5! 2.5 




allcfgs 








H 498t;,T440 


4072^:55^ 


HI 8,2p^8i 



This is repeated for all loops in the example. For some examples, no outer loop exists In this case the 
sut.optnnal hnear case is described as well as the case that the g/ven function is called wS a ^iceu 
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5.4 3x3 Edge Detector 

5.4.1 Original Code 

#define VERLEN 16 
#define HORLEN 16 
main ( ) { 

int v, h, inp; 

int pi [VERLEN] [HORLEN] ; 

int p2 [VERLEN] [HORLEN] ; 

int htmp, vtmp, sum; 

for(v=0; y< VERLEN; v++) // loop nest 1 

£or(h=0; h<HORLEN; h++) { 

scanf(»%d", &pl[v][h]); // read input pixels to pi 
p2[v][h] = 0; // initialize p2 

} 

for(v=0; v<=VERLEN-3; v++) { // loop nest 2 
for(h=0; h<=HORLEN-3,\ h++) { 

htmp = (pl[v+2J [h] '- pl[v] [h] ) + 

(pl[v+2] [h+2] -pl[v][h+2]) + 
2 * (pl[v+2] [h+1] - pi [v] [h+1]); 
if (htmp < 0) 
htmp = - htmp; 

vtmp = (pi [v] [h+2] - pl[v][h]) + 

(pl[v+2] [h+2] -pl[v+2][h]) + 
2 * (pl[v+l] [h+2] - pl[v+l][hj); 
if (vtmp < 0) 
vtmp = - vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 

sum = 255; 
p2 [v+1] [h+1] = sum; 

} 

} 

for(v=0; v< VERLEN; v++) // loop nest 3 

for(h=0; h<HORLEN; h++) 
^ printf ("%d\n", P 2 [v] [h] ) ; // print output pixels from p2 

5.4.2 Preliminary Transformations 

Due to the calls to the library functions scan/ and printf in loop nest one and loop nest three, 
respectively, only loop nest two is handled in the further sections. 

Interprocedural Optimizations 

The first step normally invokes interprocedural transformations like function inlining and loop 
pushing. Since no procedure calls are within the loop body, these transformations are not applied to 
this example. rr 
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Basic Transformations 

The following transformations are done: 

" functions 00811 " 1011 ^ ^ ^ ^ redU ° eS them t0 com P iler ^wn 

■ Tree balancing reduces the tree depth by swapping the operands of the additions. 

■ The array accesses are mapped to IRAM accesses. 

- Since this example uses different values of one IRAM within an iteration, either shift register 
synthesis or data duplication must be used. To show the difference between these two 
transformations, both are outlined here. 

The resulting code after this step is shown below. First with shift register synthesis: 

for(v=0; v<=VERLEN-3 ; v++) { 
int iramO [128] ; // pi [v] 
int iraml[128] ; // pi [v+1] 
int iram2[128]; // pi [v+2] 
int iram3[128] ; // p2[v+l][l] 

for(h=0; h<=HORLEN-l; h++) { 
// fill shift registers 

if (i>l) { tmpOO = tmpOl; tmplO = tmpll; tmp20 = tmp21; } 
if (i>0) { tmpOl = tmp02; . ; t mp21 = ^22 

tm P 02 = iramO [h] ; tm R 12 = iraml [h] ; tmp22 = iram2 [h] ; 
if (h>2) { 

htmp = 2 * (tmp21 - tmpOl) 
(tmp20 - tmpOO) 
(tmp22 - tmp02) 
htmp = abs(htmp); 
vtmp = 2 * (tmpl2 - tmplO) 
(tmp02 - tmpOO) 
(tmp22 - tmp20) 



+ 
+ 



+ 
+ 



vtmp = abs(vtmp); 

sum = min(255, htmp + vtmp); 

iram3[h-l] = sura; 



And with data duplication: 



for(v=0; v<=VERLEN-3; v++) { 

int iram0[128], iraml [128], iram2[128]; 
int iram3[128], iram4[128]; 
mt iram5[128], iram6[128], iram7 [128] ; 
int iram8[128]; // p2[v+l][l] 



// Pl[v] 
// pi [v+1] 
// pi [v+2] 



for(h=0; h<=HORLEN-3; h++) { 

tmpOO = iram0[h]; tmplO = iram3 [h] ; tmp20 
tmpOl = iraml [h+1] ; tmp21 

tmp02 = iram2[h+2]; tmpl2 = iram4 [h+2] ; tmp22 

htmp = 2 * (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + 

(tmp22 - tmp02) ; 
htmp = abs (htmp) ; 

vtmp = 2 * (tmpl2 - tmplO) + 

(tmp02 - tmpOO) + 

(tmp22 - tmp20) ; 



iram5 [h] ; 
iram6 [h+1] ; 
iram7 [h+2] ; 
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The following table shows the estimated utilization and perfonnance values. 



Parameter 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 



BREG 



FREG 



Dataflow graph width 



Dataflow graph height 



Configuration cycles 



Value (shift register synthesis) 



16 



32 



31+10=4 



Value (data duplication) 



16 



32 



8 (calc) + 3*2 (compare for shift 
register synthesis) = 14 



10 (BREG_SUB/BREG_ADD) 



3 *2 = 6 (shift register synthesis) 



12 



3 (shift registers) + 8 
(calculation) 



11+16=27 



81 + 10 = 9 



8 (calc) 



10 (BREG_SUB/BREG_ADD) 



few 



12 



8 (calculation) 



8+16=24 



S e ci^ er l0 3 calc " lati °" dataflow graph is shown inFigure 58. The inputs are either connected over 
the shift register network shown in Figure 59, or directly to an own IRAM. COIU «*ted over 

5.4.3 Enhancing Parallelism 

SOTS?!??! 6 Sh °Z S 3 Utili2 f ti ° n ° f about one fourth of *» ^Us. Until now we neglected that the 
SUB and ADD operates can be done by BREGs as well. Therefore we try to maximizfutilLation 

Unroll-and-Jam 

^^fjT - S * 0 transformation of ^ice, because of its nature to bring iterations together As 
the reused data sue mcreases, the IRAM usage does not increase proportionally to the u^oUuTgSc*to1 

The parameters which determine the unrolling factor are the overall loop count of 14 the IRAM 
utihzatton of 4 and 9, respectively and the PAE counts. The first parameter allows an unrolW deS^e 
for unroll-and-jam equal to 2 and 7, while the IRAMs restrict it to 7 and 2 resptcS &t PA? 
usage would allow an unrolling degree equal to 4 (ALU ADD/SUB replaced oy HRBG ADD/S™ 
Therefore the mmunum of all factor, must be taken, which is 2. The estimated vahfe^S £Z Tie 



Parameter 


Value (shift register synthesis) 


Value (data duplication) 1 


Vector length 


2*16 


2*16 


Reused data set size 


48 


48 


I/OIRAMs 


41+20=6 1 


12 1 + 2 O = 14 


ALU 


2*8 + 4*2=24 


2*8 = 16 


BREG 


20 


20 


FREG 


4*2 = 8 


few 


Dataflow graph width 


12 


12 
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Dataflow graph height 


3 (shift registers) + 8 
(calculation) 


8 (calculation) 


Configuration cycles 


11+16=27 (two 
outputs/configuration) 


8+16=24 (two 
outputs/configuration) 




Figure 58 The main calculation network of the edge3x3 configuration. The MULTSORT combination does the 
absQ calculation while the SORT does the minQ calculation. 
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5.4.4 Final Code 



Shift Register Synthesis 

The RISC code for shift register synthesis after unroll-and-jam reads then: 

XppPreloadConf ig ( XppCf g_edge3x3) ; 

for(v=0; v<=VERLEN-3; v+=2) { 

XppPreload(0, &pl [v] , 16); 

XppPreload(l, &pl[v+l], 16); 

XppPreload(2, &pl[v+2], 16); 

XppPreload(3, &pl[v+3], 16); 

XppPreloadClean(4, @pl[v+l][l], 14]); 

XppPreloadClean(5, @pl[v+2][l], 14]); 

XppExecute ( ) ; 

} 

The configuration reads as follows: 

void XppCfg_edge3x3 { 

// IRAMs 

int iramO [128] ; // 

int iraml [128] ; // 

int iram2 [128] ; // 

int iram3[128] ; // 

int iram4 [128] ; // 

int iram5[128]; // 



pl[v] 
pl[v+l] 
pl[v+2] 
pl[v+3] 
p2[v+l] [1] 
p2[v"+2] [1] 



for(h=0; h<=HORLEN- 1 ; h++) { 
/ / fill shift registers 

if (i>l) { tmpOO = tmpOl; tmplO = tmpll; tmp20 = tmp21; 

tmp30 = tmp31; } 
if (i>0) { tmpOl = tmp02; tmpll = tmpl2; tmp21 = tmp22; 

tmp31 = tmp32; } 
tmp02 = iramO [h] ; tmpl2 = iraml [h] ; tmp22 = iram2 [h] ; 
tmp32 = iram3 [h] ; 
if (h>2) { 

htmpO = 2 * (tmp21 - tmpOl) + 
(tmp20 - tmpOO) + 
(tmp22 - tmp02); 
htmpO = abs(htmpO); 
vtmpO = 2 * (tmpl2 - tmplO) + 
(tmp02 - tmpOO) + 
(tmp22 - tmp20); 

; 

vtmpO = abs(vtmpO); 

sumO = min(255, htmpO + vtmpO); 

iram4[h-l] = sumO; 



htmpl = 2 * <tmp31 - tmpll) + 
(tmp30 - tmplO) + 
(tmp32 - tmpl2) ; 

htmpl = abs (htmpl); 

vtmpl = 2 * (tmp22 - tmp20) + 
(tmpl2 - tmplO) + 
(tmp32 - tmp30) ; 

vtmpl = abs (vtmpl) ; 

svaal = min(255, htmpl + vtmpl); 

iram5[h-l] = suml; 
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V V IT 



O A B; SfBP 



X V u 



i i r 

iramfx] iram[x4-l]iram[x+2] 



Figure 59 Input preparation with shift register synthesis. For each IRAM access one of these modules 

generated. 



is 



Data Duplication 

Data duplication needs more preloads. 

XppPreloadConf ig ( XppCf g_edge3x3 ) 

for(v=0; v<=VERLEN-3; v+=2) { 
XppPreload ( 0 , &pl [v] , 16); 
XppPreloadd, &pl[v], 16); 
XppPreload (2, &pl [v] , 16); 
XppPreload (3, &pl[v+l], 16) 
XppPreload(4, Spl[v+1], 16) 
XppPreload (5, &pl[v+l], 16) 
XppPreload (6, &pl[v+2], 16) 
XppPreload (7, &pl[v+2], 16) 
XppPreload (.8, &pl[v+2], 16) 
XppPreload (9, spl[v+3], 16) 
XppPreload (10, &pl [v+3] , 16); 
XppPreload (11, &pl[v+3], 16); 
XppPreloadClean(12, @pl[v+l][l] 
XppPreloadClean(13, @pl[v+2][l] 
XppExecute ( ) ; 



14] ) 
14]) 



On the other hand the configuration is less complex. 



void XppCfg_edge3x3 

// IRAMs 
int iram0[128], 
int iram3[128], 
int iram6[128], 
int iram9 [128] , 
int iraml2[128] 



{ 



iraml [128] , 
iram4 [128] 
iram7 [128] , 
iramlO [128] 



// p2[v+l] [1] 



iram2 [128] ; 
iram5 [128] ; 
iram8 [128] ; 
iramll[128] 



// Pl[v] 
// pl[v+l] 
// pl[v+2] 
// pi [v+3] 
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int iraml3[128] ; // p2[v+2][l] 

for(h=0; h<=HORLEN-3; h++) { 

tmpOO = iramO[h]; trap 10 = iram3[h]; 
tmp20 = iram6[h]/ tmp30 = iram9[h]; 
tmpOl = iraml[h+l]; tmpll = iram4 [h+1] ; 
tmp21 = iram7[h+l] ; trap 31 = iramlO [h+1] ; 
tmp02 = iram2[h+2] ; tmpl2 = iram5[h+2]; 
tmp22 = iram8[h+2]; tmp32 = iramll[h+2]; 
htmpO = 2 * (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + 

(tmp22 - tmp02); 
htmpO = abs (htmpO) ; 
vtmpO = 2 * (tmpl2 - tmplO) + 

(tmp02 - tmpOO) + 

(tmp22 - tmp20) ; 

r 

vtmpO = abs ( vtmpO ) ; • • 

sumO = min(255, htmpO + vtmpO) ; 
iraml2[h] = sumO; 

htmpl = 2 * (tmp31 - tmpll) + 

(tmp30 - tmplO) + 

(tmp32 - tmpl2) ; 
htmpl = abs (htmpl); 
vtmpl = 2 * (tm P 22 - tmp20) +' 

(tmpl2 - tmplO) + 

(tmp32 - tmp30) ; 

r 

vtmpl = abs (vtmpl) ; 

suml = min(255, htmpl + vtmpl); 

iraml3[h] = suml; 

} 

} 



5.4.5 Performance Evaluation 

reu e s e ne whiS° m ? bIeS J i f ^ < f imated P erformance of transfers. The values consider the data 
reuse, wh 1C h means that after the startup, which preloads 4 picture rows, each iteration only advances 
two picture rows. Therefore two rows are reused and stay in the cache advances 
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2 
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2 
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pllv+2J 
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112 


4 


P1LV+3J 
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2 


112 
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piiynreuseply+2]) 


64 




0 


4 


plLv+1] (reuse p[v+3]) 


64 




0 


4 


plLv+2J 


64 


2 


112 


4 


pl[v+3] 


64 


2 


112 


4 




mm 
















p2lv+l] 


56 


2 


176 


4 


P 2Lv+2j 


56 


2 


176 


4 






«S 'sfej 















pitvj p times) 


64 


2 


112 


12 


plLv+lJ(3times) 


64 


2 


112 


12 


pllv+2](3times) 


64 


2 


112 




pl|y+3](3times) 


64 


2 


112 




















plLvJ (reuse p[v+2J, 3 times) 


64 




Oj 


L 12 


plLv+1] (reuse p[y+3J, 3 times) 


64 




0 


12 


plLv+2](3times) 


64 


2 


112 


12 


plLv+3J(3times) 


64 


2 


112 


12 




5*.»88sm 












pziv-t-ij 


56 


21 64 


4 


p2|y+2J 


56 


2 64 


4 



Ae^to^ 0118 ' repreSenting *" «• hand coded ^ NML and mapped and simulated with 

Hie simulation yields - scaled to the cache frequency - 124 and 144 cycles, respectively This is 

the 'foUoZr^hl^ 311 ^ j* 8 ^ f 11 ^* 0113 with * e system yields the results in 

the following .table. The first two rows of a section list the startup state and the steady state of thev- 

1 'iT 6 ^ P ^ a tap ° 0Unt ° f 7 ' 1116 00111111,18 sum calculate to state + 7*steady 

S no iris S rSST** 1 " Perf0rmanCe ' ^ ** Preload cannot be hidden and 
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The results show the dominance of the configuration preload. Although the core performance of the 
case using data duphcation is worse than the case using shift register synthesis ftfafa^SSwbfei 
*e values mcluding the memory hierarchy. The next table assuLs thaTSg\? a tion preta^can be 
issued early enough, so it can be hidden and must not be taken into account P 




The results again show the impact of the configuration preload for configurations that calculate small 
or mednan amounts of data. When it can be hidden, performance is almosfdoubled in m fe^npie 

The comparison to the reference system shows less improvement compared to other examoles The 
reason is the short vector length. Nevertheless pictures of size 16x16 are Zy^yZ^^Z^l 
SSI * * e ™« «**-. which embeds the ei^^T^ZZ 

^ Gnal ^^ 0n iS f OWD * ^ neXt tabIe - As estimations did not account for counters and 
other controlling networks, the values for BREGs and FREGs differ significantly. 



Parameter 


Value (shift register synthesis) 


Value (data duplication) 


Vector length 


2* 16 


2* 16 


Reused data set size 


48 


48 


I/O IRAMs [sum -pet] 


6 - 38% 


14 - 88% 


ALU[sum-pct] 


33 - 52% 


19-30% 


BREG [defroute/sum-pct] 


34/14/58 - 73% 


36/20/56 - 70% 


FREG [def route/sum-pct] 


25/27/52 - 65% 


9/38/47 - 59% 
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5.4.6 Parameterized Function 



Source code 

^Zt^zrzt^z 22k s^zx* fonn * re ? worid 

with the sizes of the picture to work on. parameters for input and output arrays along 

Therefore the source code would look similar to: 

int *p2, int HORLEN, int VERLEN) 



void edge3x3(int *pl, 

for(v=0; v<=VERLEN-3; v++) { 
for(h=0; h<=HORLEN-3; h++) { 

htmp = (**(pl + (v+2) * HORLEN 
(**(pl + 
2 * (**(pl + 
if (htmp < 0) 

htmp = - htmp; 
vtmp = (**(pl + 

(**(pl + 
2 * (**(pl + 
if (vtmp < 0) 
vtmp = - vtmp; 



(v+2) 
(v+2) 



v 

(v+2) 
(v+1) 



HORLEN 
HORLEN 



HORLEN 
HORLEN 
HORLEN 



h) 

h+2) 
h+1) 



h+2) 
h+2) 
h+2) 



sum = htmp + vtmp; 
if (sum > 255) 

sum = 255; 
**(p2 + (v+1) * HORLEN + h+1) 



** (pi 
**(pl 
** (pi 

** (pi 
**(pl 



— * * 



(Pi 



HORLEN 
HORLEN 
HORLEN 



h)) + 
h+2) ) + 
h+1) ) ; 



v * 
(v+2) 
(v+1) 



HORLEN 
* 



h)) 

HORLEN + 
HORLEN + 



+ 

h) ) + 
h)); 



sum; 



5.4.7 Transformations 

torn Somp£. ^ ** fa Se °«°" 5 «1— -me additional (eat^s 
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Figure 60 A sample picture with the size 640 x 480 pixels. Without precautions loop tiling 
would miss the pixels on the borders between the tiles. 

The loop nest reads then as follows. We show only the variant with shift register synthesis, with the 
loop body omitted for better reading. As stated above, the tile size is 128 (IRAM size) but the tile 
advancing loops increase by 125, overlapping the tiles correctly. The loop body equals the one in5.4 4 
Omit Register Synthesis). " 

for (v=0: v <= VERLEN-3; v+= 125) 
for(h=0; h <== HORLEN-3; h+= 125) 

for (w=v; w< min(v+ 127, VERLEN-2); v+=2) 
for(hh=h; hh< min (h+ 127, HORLEN-2) ; hh++) { 



5.4.8 Final Code 

hi addition to the sunple variant, the final tile size of the innermost loop has to be passed to the array. 
Therefore die RISC code reads as follows, where the body of the guarded first iteration for odd tile 
sizes is omitted for simplicity. 

XppPreloadConf ig ( XppCf g_edge3x3 ) ; 

for (v=0: v <= VERLEN-3 ; v+= 125) 
for(h=0; h <= HORLEN-3; h+= 125) { 
v_tilesize = min(128, VERLEN - v) ; 
if <v_tilesize & 1 >= 0) { 
// calculate line on RISC 
v++; tilesize &= 1; 

} 

for (vv=v; vc< v + v_tilesize; v+=2) { 
tilesize = min(128, HORLEN-h) ; 
XppPreload(0, &pl [ w] [h] , tilesize); 
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S re ° a , d ' *Pl[w+l][ h ] # tilesize); 

Xp P PreloadClean(4, 8pl [w+l] [h+ l] , tilesize 
XppPreloadClean(5, @pl [vv+2 -h+1 lii " 
xppPreload(6, &tilesize, 1) - 
^ XppExecute ( ) ; 

The configuration reads then. 

void Xppcfg_edge3x3 { 

// IRAMs 

int iramO [128] ; 

int iraml[128]; 

int iram2 [128] ; 

int iram3 [128] ; 

int iram4[128]; 

int iram5 [128] ; 

int iram6 [128] / 

for(h=0; h<=iram6[0] ; h++) 

/ / fill shift registers 



2] ) ; 
2]); 



// 
// 
// 
// 
// 



Pi [vv] 
pi [w+1] 
pi [vv+2] 
pl[w+3] 
p2[vv+l] [h+1] 
// p2 [vv+2] [h+1] 
// tilesize 



{ 



if (i>l) { 



if (i>0) { 



tmpOO 
tmp30 
tmpOl 
tmp31 

tmp02 = iramO [h] , 
tmp32 = iram3 [h] j 
if (h>2) { 
htmpO = 2 * 



tmpOl; 
= tmp31; 
= tmp02; 
= tmp32; 
tmpl2 = 



tmplO 
} 

tmpll 
} 

iraml [h] 



= tmpll; tmp20 = tmp21; 
= tmpl2; tmp21 = tmp22; 
tmp22 = iram2 [h] ; 



htmpO 
vtmpO 



(tmp21 - tmpOl) + 
(tmp20 - tmpOO) + 
(tmp22 - tmp02) ; 
abs (htmpO) ; 
2 * (tmpl2 - tmplO) + 
(tmp02 - tmpOO) + 
(tmp22 - tmp20) ; 
vtmpO = abs ( vtmpO); 
sumO = min(255, htmpO + vtmoO) • 
iram4[h-l] = SU mO; 

htmpl = 2 * (tm P 31 - tmpll.) + 
(tmp30 - tmplO) + 
(tmp32 - tmpl2) ; 

htmpl = abs (htmpl); 

vtmpl = 2 * (tmp22 - tmp20) + 
(tmpl2 - tmplO) + 
(tmp32 - tmp30) ; 

r 

vtmpl = abs (vtmpl) ; 

suml = min(255, htmpl + vtmpl); 

iram5[h-l] = suml; 



The estimated utilization and worst- 



case performance (full tile) is shown below. 
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Parameter 


Value 


Vector length 


L 2 * 128 


Reused data set size ~ 


384 




51 + 2 0 = 7 


ALU 


2*8 + 4*2 = 24 


tsKjbCi 


1 20 


FREG 


4*2 = 8 


Dataflow graph width 


12 


Dataflow graph height 


3 (shift registers) + 8 (calculation) n 


Configuration cycles 


11+128 = 139 



5.4.9 Performance Evaluation 



750 * ^.P** Pi-*™ similar to that shown in Figure 60 Wa choose the ,„ 
simplify measurements since the dimensions are both rnultinles of 12? tL TI- . j j ? 
performanee is shown in the table below. multiples of 125. The estimated data transfer 

3e™X f ex^nSn 2Z "mnTT *" *" f °"" — ta «— *■ RAM 




* 5^^^^^^^ ™ ? a *l-, To *• with 
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configurations 



edge3x3 startup 
edgs3x3 steady 
sum 



RAM DCache 



3548 
2816 



RAM ICache 




128l^|| 



«$m£ 7544321^ 



ferap?" °Top^ liZati ° n i S Sh0Wn 111 following table. As mentioned above, the big differences 
for FREGs and BREGs stem from the missing estimations for counter and controlling PAEs 



Parameter 


Value 


Vector length 


2* 128 


Reused data set size 


384 


I/O IRAMs [sum -pet] 


7 - 44% 


ALU[sum-pct] 


27-43% 


BREG [deff route/sum-pct] 


41/21/62 - 78% 


FREG [deffroute/sum-pct] 


25/34/59 - 74% 
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5.5 FIR Filter 

5.5.1 Original Code 

Source code: 

#define N 256 
#define M 8 

int x[N], y[N]; 

const int c[M] = { 2, 4, 4, 2, 0, 7, -5, 2 }; 

main ( ) { 
int i, j; 

/* code for loading x */ 

for (i = 0; i < N-M+l; i++) { 

y[i] = 0; // s 

for (j = 0; j < M; j++) 
^ y[i] += c[j] * x[i+M-j-l] ; // S' 

/* code for storing y */ 
} 

?&e?oIwi^ and Mare replaCed ^ Va,U6S by * e P re -P rocess ^ H» ^ dependence graph 

© 



for (i = 0; i < 249; i++) { 
S: y[i] = 0; 

for (j = 0; j < 8; j++) 
S': y[i] += c[ j] * x [i+7-j]; 

We have the following table: 
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Parameter 


Value 


Vector length 


input: 8, output: 1 


Reused data set size 


- 


I/O IRAMs 


3 


ALU 


2 


BREG 


0 


FREG 


0 


Dataflow graph width 


1 


Dataflow graph height 


2 


Configuration cycles 


2+8=10 



u- T* ° f Paral ellSm available m a lo °P fc* 1 ^ to ^crease the amount of memory 

M 0 S° e fr u COn $T tl ° nS ° f thC ° ptimiZed l0 ° P b0d y- In *** case > *• ™^imal parXTsm 
° bt * m f whe , n a » multiplications of the inner loop are done in parallel, and the inner tooofa 
completely unrolled. This way, 8 elements of array* are needed at each cycle. ThTs is ^nlZss M bv 
using data duplication, which means that all 16 IRAMs (2 IRAMS for each copy ofTay ! ) « 
needed to store array *, and consequently array y has to be output directly on the ou^ut poTLmiml 

v^uSa^x". ° nIy ? ^ f ° r " W ° Uld be W * Process S 

The latter is possible in this case as arrays is a global variable, but it won't be possible if it would be 

££2*™ AM TV ° n> r rt \ USU /" y ^ ^ M ° re0Ver ' 33 1116 same »«"* ^ loaded hi the 
different IRAMs from the cache for array *, we have a lot of transfers to achieve before the 

configuration can begin the computations. The performance of this algorithm is bounded by memory 
access times and thus there is no need to maximize parallelism. For this reason, the solution chosS w 
^T?- *" 6XtaCt ^ l&lism to release the pressure on the memory hierarchy ft is 
orcomptiTon ° NeVertheIeSS ^ more soluti °* is also presented to Ea poin 



5.5.2 Solution chosen by the compiler 



it Parallelism in the inner loop, the straightforward solution is to unroll the inner loop. No 

t^LTf TZ*^ 11 ? bef ° re 38 6ither *«* d ° «* have an effect on the loop or they increase 
the need for IRAMs. After loop unrolling, we obtain the following code- 



for (i = 


0; 


i < 


249; i++) 


{ 


y[i] 


= 0; 






y[i] 


+= 


c[0] 


* 


x[i+7] 




yfi] 


+= 


c[l] 


* 


x[i+6] , 




y[i] 


+= 


c[2] 


* 


x[i+5] 




y[i] 


+= 


c[3] 


* 


x[i+4] 




y[i] 


+= 


c[4] 


* 


x[i+3] , 




y[i] 


+= 


c[5] 


* 


x[i+2], 




y[i] 


+= 


c[6] 


* 


x[i+lj , 




yfi] 


+= 


c[7] 


* 


x[i] ; 



Then the parameter table looks like this: 
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Parameter 


Value 


Vector length 


input: 256. outnut: 24Q 


Reused data set size 




I/OIRAMs 


5 


ALU 


16 


BREG 


0 


FREG 


0 


Dataflow graph width 


2 1 


Dataflow graph height 


9 


Configuration cycles 


9+249=258 



Dataflow analysis reveals that y[0]=f(xf01 xf7J) vfIJ=f/ r m ^rsi\ r i r ^ 
Successive values of y depend on almost *e^« 
accesses to the IRAMs, the values of, needed for the 

registers. In our case this shift register synthesis needs 7 registers. This will be IcWeved on fte PACT 

JESS? w^nt lining %^ ^ - t0 ^ l0 ° P n^ded 

MCT^Ihrcol^ 8126 ° f .f%T y S SmalIer than **» total of memory available on tae 

fAC r XPP. The code becomes the following after shift register synthesis: 

cO = c[0 
cl = c[l 
c2 = c[2 
c3 = c[3 
c4 = c[4 
c5 = c[5 
c6 = c[6 
c7 = c[7 



rO = x 
rl = x 
r2 = x 
r3 = x 
r4 = x 
r5 = x 
r6 = x 
r7 = x 
for (± 

y[i] 

rO = 
rl = 
r2 = 
r3 = 
r4 = 
r5 = 
r6 = 
r7 = 

} 



[0 
[1 
[2 
[3 
[4 
[5 
[6 
[7 

= 0; 

= c7* 

rl; 

r2; 

r3; 

r4; 

r5; 

r6; 

r7; 

x[i+7 



i < 24 9; i++) { 

rO + c6*rl + c5*r2 + C 4*r3 + c3 * r4 + c2 *r5 + cl*r6 + c0*r7; 



]; 



^^^^^^^^^^ 
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int *piram0_l, *piraml_l; 

piram0_l = &xl[0]; 
piraml_l = &yl[0]; 

for (i = 0;i < 249; i++) 



{ 



rO = rl; 
rl = r2; 
r2 = r3; 
r3 = r4; 
r4 = r5; 
r5 = r6; 
r6 = r7; 
r7 = xl++; 

if (i < 128) 

piram0_l+4- = x2++; 
else 

if (i — 128) 

xl = &xl [0] ; 

yl++ = c7*r0 + c6*rl + c5*r2 + c4*r3 + c3*r4 + 
c2*r5 + cl*r6 + c0*r7; 

if (i < 128) 

y2++ = piraml_H ; +; 
else 

if (i == 128) 

yl = &yl[0] ; 

} 

The dataflow graph representing the loop body is shown below. 




The final parameter table is shown below. 
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Parameter 


Value 


Vector length 


input: 256, output: 249 


Reused data set size 




I/O IRAMs 


4 


ALU 


15 


BREG 


0 


FREG 


7 


Dataflow graph width 


3 


Dataflow graph height 


9 


Configuration cycles 


9+249=258 



Variant with Larger Loop Bounds 

Let us take larger loop bounds and set the values of Wand Mto 2048 and 64. 

for (i =0; i < 1985; i++) { 
y[i] = 0; 

for (j = 0; j < 64; j++) 
y[i] += c[j] * x[i+63-j] ; 

} 



The loop nest needs 17 IRAMs for the three arrays, which makes it impossible to execute on the 
PACT XPP. Following the loop optimizations driver given before, we apply loop tiling to reduce the 
number of IRAMs needed by the arrays, and the number of resources needed by the inner loop We 
use a size of 512 for* andy, and 16 for c. Theoretically, we could have taken bigger sizes, and occupy 
more IRAMs, but subsequent optimizations will need more IRAMs. This can already be stated as the 
amount of parallelism in the innermost loop is low, and to increase it more resources will be needed, 
therefore we must take smaller sizes. We obtain the following loop nest , where only 9 IRAMs are 
needed for the loop nest at the second level. 

for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min (512, 1985-ii*512) ; i++) { 
y[i+512*ii] = 0; 
for (jj = 0; jj < 4; jj++) 
for (j = 0;j < 16;j++) 
^ y[i+512*ii] += c[16*jj+j] * x[i+512*ii+63-16*jj-j] ; 

A subsequent application of loop unrolling on the inner loop yields: 

for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min (512, 1985-ii+512) ; i++) { 
y[i+512*ii] - 0; 
for (jj = 0; jj < 4; jj++) { 

y[i+512*ii] += c[16*jj] * x [i+512*ii+63-16* j j] ; 

y[i+512*ii] += c[16*jj+l] * x[i+512*ii+62-16*j j ] 

y[i+512*ii] += c[16*jj+2] * x[i+512*ii+61-16*j j ] 

y[i+512*ii] += c[16*jj+3] * x [i+512*ii+60-16* j j ] 

y[i+512*ii] += c[16*jj+4] * x [i+512*ii+59-16* j j ] 

y[i+512*ii] += c[16*jj+5] * x[i+512*ii+58-16*j j ] 

y[i+512*ii] += c[16*jj+6] * x[i+512*ii+57-16*jj] 

y[i+512*ii] += c[16*jj+7] * x[i+512*ii+56-16*jj] 
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y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 

y[i+512*ii] += 



c[16*jj+8] 
c[16*j j+9] 
c[16*j j+10] 
c[16*j j+11] 
c[16*j j+12] 
c[16*jj+13] 
c[16*j j+14] 
c[16*jj+15] 



* x[i+512*ii+55-16*j j] ; 

* x[i+512*ii+54-16*j j] ; 

* x[i+512*ii+53-16*j j] 

* x[i+512*ii+52-16*jj] 

* x[i+512*ii+51-16*j j] 

* x[i+512*ii+50-16*j j] 

* x[i+512*ii+49-16*j j ] 

* x[i+512*ii+48-16*jj] 



Finally we obtain the same dataflow graph as above, except that the coefficients must be read from 
another IRAM rather than being directly handled like constants by the multiplications After shift 
register synthesis the code is the following: 

for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min (512, 1985-11*512) ; i++) { 
rO = x[i+512*ii+48] 
rl = x[i+512*ii+49] 
r2 = xri+512*ii+50] 
r3 = x[i+512*ii+51] 
r4 = x[i+512*ii+52] 
r5 - x[i+512*ii+53] 
r6 = x[i+512*ii+54] 
r7 = x[i+512*ii+55] 
r8 .= x[i+512*ii+56] . 
r9 = x[i+512*ii+57] 
rlO = x[i+512*ii+58] 
rll = x[i+512*ii+59] 
rl2 = x[i+512*ii+60] 
rl3 = x[i+512*ii+61] 
rl4 = x[i+512*ii+623 
rl5 = x[i+512*ii+63] 
for (jj = 0; jj < 4; jj++) { 

y[i] = c[8*jj]*rl5 + c[8*j j+1] *rl4 + c [8*j j+2] *rl3 + 

c[8*j j+3]*rl2 + c[8*jj+4] *rll + c [8*j j+5] *rl0 + 
c[8*jj+6]*r9 + c[8*jj+7]*r8 + c[8*jj+8]*r7 + 

+9]*r6 + c[8*j j+10] *r5 + c [8* j j+11] *r4 + 
+12]*r3 + c[8*jj+13]*r2 + c [8* j j+14] *rl + 





c[8 




c[8 




c[8 


rO = 


rl; 


rl = 


r2; 


r2 = 


r3; 


r3 = 


r4 ; 


r4 = 


r5; 


r5 = 


r6; 


r6 = 


r7; 


r7 = 


r8; 


r8 = 


r9; 


r9 = 


rlO; 


rlO 


= rll; 


rll 


= rl2; 


rl2 


= rl3; 


rl3 


= rl4; 


rl4 


= rl5; 


rl5 


= x[i+ 
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Parameter 


Value 


Vector length 


input: 8, output: 1 


Reused data set size 


. 


I/OIRAMs 


3 


ALU 


31 


BREG 


0 


FREG 


15 


Dataflow graph width 


3 


Dataflow graph height 


17 


Configuration cycles 


4+17=21 



5.5.3 A More Parallel Solution 

The solution we presented does not expose maximal parallelism in the loop. This can be done by 
explicitly parallelizing the loop before we generate the dataflow graph. Of course, as explained before 
exposmg more parallelism means- more pressure on the memory hierarchy. 

In the data dependence graph presented at the beginning, the only loop-carried dependence is the 
dependence on i S and it is only caused by the reference to yfij. Hence we apply node splitting to get a 
more suitable data dependence graph, and a statement that can be parallelized. We obtain then: 

for (i = 0; i < 249; i++) { 
Y[i] = 0; 

for (j = 0; j < 8; j++) 
{ 

tmp = c[j] * x[i+7-j] ; 
y[i] += tmp; 

} 

} 

Then scalar expansion is performed on tmp to remove the anti loop-carried dependence caused by it 
and we have the following code: 

for (i =0; i < 249; i++) { 
y[i] *■ 0; 

for (j = 0; j < 8; j++) 
{ 

tmp[j] = c[j] * x[i+7-j]; 
y[i] += tmp[j]; 

} 

} 
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Parameter 


Value 


Vector length 


input: 8, output: 1 


Reused data set size 


_ 


I/OIRAMs 


3 


ALU 


2 


BREG 


0 


FREG 


1 


Dataflow graph width 


2 


Dataflow graph height 


2 


Configuration cycles 


2+8=10 



Then we apply loop distribution to get a vectorizable and a not vectorizable loop. 

for (i = 0; i < 249; i++) { 
y[i] = 0; 

for (j = 0; j < 8; 

tmp[j] = c[j] * x[i+7-j] ; 
for (j = 0; j < 8; j++) 
y[i] += tmp[j]; 

} 

} 

p?ece P d1n7tS able ^ ^ COrreSp ° nds to inner loo P s * order to be compared with the 



Parameter 


Value 


Vector length 


input: 8, output: 1 


Reused data set size 




I/OIRAMs 


5 


ALU 


2 


BREG 


0 


FREG 


1 


Dataflow graph width 


1 


Dataflow graph height 


3 


Configuration cycles 


1*8+1*8=16 



wtLTeT^* . rat ° + account *• architecture. The first loop is fully parallel; this means that we 
oTthe PACT m h P !? ^ " tUn ^ ThiS iS ^ 38 * corres P°nds to the number of IRAMS 
flnt J£E T W ! d °^ 0t " e6d t0 stri P" mine me first inner loop. The case of the second 

loop is trivial, it does not need to be strip-mined either. The second loop is a reduction, it computes the 

fXwmgcodf' 15 y y *" redUCti0n rec °S nition optimization and we obtain the 
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for (i = 0; i < 249; i++) { 
y[i] = 0; 

for (j =0; j < 8; j++) 

tmp[j) = c[j] * x[i+7-j] ; 

/* load the partial sums from memory using a shorter vector lenath * / 
for (j = 0; j < 4; j++) * 
auxCj] = tmp[2*j] + tmp [2*j+l] ; 

/* accumulate the short vector */ 
for (j = 0;j < 1; j++) 

aux[2*j] = aux[2*j] + aux[2*j+l] ; 

/♦sequence of scalar instructions to add up the partial sums */ 
y[i] = aux[0] + aux[2] ; 



give only one table for all innermost loops and the last instruction computing >>/"//. 



Parameter 


Value 


Vector length 


input: 256, output: 249 


Reused data set size 




I/O IRAMs 


9 


ALU 


4 


BREG 


0 


FREG 


0 


Dataflow graph width 


1 


Dataflow graph height 


4 


Configuration cycles 


1*8+1*4+1*1=13 



Finally loop unrolling is applied on the inner loops, the number of operations is always less than the 
number of processing elements of the PACT XPP. 



for (i 
{ 



0; i < 249; i++) 



tmp[0] 




e[0] * 


x[i+7] 




tmp[l] 




c[l] * 


x[i+6] 




tmp[2] 




c[2] * 


x[i+5] 




tmp[3] 




c[3] * 


x[i+4] 




tmp [ 4 ] 




c[4] * 


x[i+3] 




tmp[5] 




o[5] * 


x[i+2] 




tmp [6] 




c[6] * 


xfi+1], 




tmp f 7] 




c[7] * 


x[ij ; 




aux[0] 




tmp[0] 


+ tmp[l]; 


aux[l] 




tmp [2] 


+ tmp [3]; 


aux[2] 




tmp [4] 


+ tmp [5]; 


aux [ 3 ] 




tmp [6] 


+ tmp [7]; 


aux[0] 




aux[0] 


+ aux[l] ; 


aux [2] 




aux [2] 


+ aux [3] ; 


y[i] - 


aux[0] + 


aux [2] ; 
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We obtain then the following dataflow graph representing the inner loop. 



x[i+7] 




x[MJ] 




xfl+5] 




X[i+4] 




xp+3] 




xO+2] 




x[i+1] 




x[i] 




It could be mapped on the PACT XPP with each layer executed in parallel, thus needing 4 
cycles/iteration and 15 ALU-PAEs, 8 of which needed in parallel. As the graph is already 
synchronized, the throughput reaches one iteration/cycle, after 4 cycles to fill the pipeline The 
coefficients are taken as constant inputs by the ALUs performing the multiplications. 

The drawback of this solution is that it uses 16 IRAMs, and that the input data must be stored in a 
special order. But due to data locality of the program, we can assume that the data already reside in the 
cache. And as the transfer of data from the cache to the IRAMs can be achieved efficiently the 
configuration is executed on the PACT XPP without waiting for the data to be ready in the IRAMs 
The parameter table is then the following: 



Parameter 


Value 


Vector length 


input: 256, output: 249 


Reused data set size 




I/O IRAMs 


16 


ALU 


15 


BREG 


0 


FREG 


0 


Dataflow graph width 


8 


Dataflow graph height 


L 4 


Configuration cycles 


4+249=253 



Variant with Larger Bounds 

To make the things a bit more interesting, we set the values of iVand Mto 2048 and 64. 



for (i = 0; i < 1985; i++) { 
y[i] = 0; 

for (j = 0; j < 64; j++) 
y[i] += c[j] * x[i+63-j]; 
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The data dependence graph is the same as above. We apply then node splitting to get a more 
convenient data dependence graph. 



for (i = 0; i < 1985; i++) { 
y[i] = 0; 

for (j = 0; j < 64; j++) 
{ 

tmp = c[j] * x[i+63-j]; 
y[i] += tmp; 

} 

} 



After scalar expansion: 

for (i = 0; i < 1985; i++) { 
y[i] = 0; 

for (j = 0; j < 64; j++) 
{ ' 

tmp[j] = c[j] * x[i+63-j]; 
y[i] += tmp[j] ; 

} 

} 

After loop distribution: 

for (i =0; i < 1985; i++) { 
y[i] = 0; 

for (j = 0; j < 64; jf++) 

tmptj] = c[j] * x[i+63-j]; 
for (j = 0; j < 64; j++) 
y[i] += tmp[j] ; 

} 

} 

.We go through the compiling process, and we arrive to the set of optimizations that depends upon 
architectural parameters. We want to split the iteration space, as too many operations would have to be 
performed in parallel, if we keep it as such. Hence we perform strip-mining on the 2 loops. We can 
only access 16 data at a time, so, because of the first loop, the factor will be 64*2/16 = 8 for the 2 
loops (as we always have in mind that we want to execute both at the same time on the PACT XPP). 

for (i =0; i < 1985; i++) { 
y[i] = 0; 

for (jj = 0; jj < 8; jj++) 
for (j=0;j < 8; j++) 

tmp[8*j j+j] = c[8*jj+j] * X [i+63-8*jj-j] ; 
for (jj =■ 0; jj < 8 ; j j++) 
for (j=0;j < 8; j++) 
} y[i] += tmp[8*jj+j] ; 

And then loop fusion on the jj loops is performed. 

for (i = 0; i < 1985; i++) { 
y[i] = 0; 

for (.jj = 0; jj < 8; jj++) { 
for (j=0;j < 8;j++) 

tmp[8*jj+j] = C [8*jj+j] * x[i+63-8*jj-j] ; 
for (j=0;j < 8;j++) 

y[i] += tmp[8*j j+j] ; 

} 

} 
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Now we apply reduction recognition on the second innermost loop. 

for (i =0; i < 1985; i++) { 
tmp = 0; 

for (jj = 0; jj < 8; jj++) 
{ 

for (j = 0; j < 8; j++) 

tmp[8*jj+j] = c[8*jj+j] * x[i+63-8*jj-j] ; 

/* load the partial sums from memory using a shorter vector length */ 
for (] =0; j < 4; j++) * ' 

aux[j] = tmp[8*jj+2*j] + tmp[8*j j+2*j+l] ; 

/* accumulate the short vector */ 
for (j = 0;j < 1; j++) 

aux[2*j] = aux[2*j] + aux[2*j+l] ; 

/* sequence of . scalar instructions to add up the partial sums */ 
y[i] = aux[0] + aux[2]; 



And then loop unrolling. 



for 



(i 
for 
{ 



= 0; i 
(jj = 



< 1985; 
0; jj < 



i++) 

8; jj++) 



tmp[8*jj] = 
tmp[8*j j+1] 
tmp[8*j j+2] 
tmp[8*j j+3] 
tmp[8*j j+4] 
tmp[8*j j+5] 
tmp[8*j j+6] 
tmp[8*j j+7] 



c[8*jj] * 
= c[8+jj+l] 
= c[8*jj+2] 
= c[8*jj+3] 
= c[8*jj+4] 
= c[8*jj+5] 
- c[8*jj + 6] 
= c[8*jj+7] 



x[i+63-8*j j ] ; 

* x[i+62-8*j j] 

* x[i+61-8*jj] 

* x[i+59-8*jj] 
x[i+58-8*jj] 
x[i+57-8*j j] 
x[i+56-8*j j] 
x[i+55-8*j j] 



aux[0] = tmp[8*jj] + tmp [8*j j+l] ; 

aux[l] = tmp[8*jj+2] + tmp[8*jj+3] 

aux[2] = tmp[8*jj+4] + tmp[8*jj+5] 

aux[3] = tmp[8*jj + 6] + tmp[8*jj+7] 

aux[0] = aux[0J + aux[l]; 

aux[2] = aux[2] + aux[3]; 

y[i] = aux[0] + aux[2]; 



^implement itoe innermost loop on the PACT XPP directly with a counter. The ERAMs are used in 
™ ode ! ^ d A ?!L ed "x™*"* to the addresses of the arrays in the loop. IRAMO, IRAM2, IRAM4 
IRAM6 and IRAM8 contain array c. IRAM1, IRAM3, IRAM5 and IRAM7 contam array x Z c 
contams 64 elements that is ; each IRAM contains 8 elements. Arrays contains 1024 elements, ttJ is 
128 elements for each IRAM. Array, is directly written to memory, as it is a global array and its 
address is constant This constant is used to initialize the address counter of the configuration The 
final parameter table is the following: s««tuou. ine 
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Parameter 


Value 


Vector length 


innut" 8 oiitnut* 1 


Reused data set size 




I/OIRAMs 


16 


ALU 


15 


BREG 


~ 0 


FREG 


0 


Dataflow graph width 


8 


Dataflow graph height 


4 


Configuration cycles 


4+8=12 



? ? h UOt f *f 11115 Version shouid be Iess e ffi«ent than the previous one As the 
same , data must be loaded m fee different IRAMs from the cache, we have a lot of transfers to 'achieve 
SSl C S nfig T tl0n Can u ^ stations. ™ s overhead must be taken 
compiler when choosing the code generation strategy. As already stated, this means m^mffirst 
solution is the solution that will be chosen by the compiler. 



5.5.4 Final Code 

int x[256] , y[256] ; 

const int c[8] = { 2, 4, 4, 2, 0, 7, -5, 2 }; 

main ( ) 
< 

XppPreloadConf ig ( XppCfg_fir) ; 

XppPreload(0, x,128); 
XppPreloadd, x +128,128); 
XppExecute ( ) ; 
XppSync(y,2 4 9) ; 

} 



void XppCf g_f ir ( ) { 

// Input IRAMs 

int iram0_l [128] , iram0_2 [128] ; 
// Output IRAMs 

int iraml_l[128] , iraml_2 [128] ; 

int *pir am0_l , *pi r aml^l ; 

piram0_l = &iram0_l [0] ; 
piraml_l = &ir-aml_l [0] ; 



for 


(i 




0;i < 249;i++) 


{ 








rO 




rl; 




rl 




r2; 




r2 




r3; 




r3 




r4; 




r4 




r5; 




r5 




r6; 




r6 




r7,- 
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r7 = iram0_l++; 

if (i < 128) 
piram0__l++ = iram0_2++; 
else 

if (i -= 128) 

iram0_l = &iram0_l [0] ; 

iraml_l++ = C 7*r0 + c6*rl + C 5*r2 + C 4*r3 + c 3*r4 + 
c2*r5 + cl*r6 + C 0*r7; 

if (i < 128) 

iraml_2++ = piraml_l++ ; 

else 

if (i == 128) 

iraml_l = &iraml 1[0]; 



5.5.5 Performance Evaluation 

arrays causes no cache Jt£^i^!^^ ^ ***** °' 




issue 32 bit fixed poSt ^TffS^SSfS ^ USU18 ? PI<5duCtion «^ f <* • dual 
reference system, can be neglected f^e Lm^H P"" 1 * » *" Same for ^ «* 

load and memory store m one c^cle «mpar«»n. It is assumed that the DSP can perform a 

data IbrffLfigG^^ below liste *o performance 



configurations 



startup case 

configurations 
' state 



RftM D& cne 
T79S §g 



2464 



E^*pgg$^< iiuiiijii .,. 



~ . Cache )'. . ... 
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implemented so that configuration loading and configuration execution do not overlap. 

The final utilization of the resources is shown in the following table. The information is taken from the 
U * l?Zu fr ° m S° ^ SOUrCe COde by ^ too1 - The difference conc^g ihe 

ST IS ? K 6tWeen *? ta ? e and * e fmal P^ 6 ^ teble P rese ^ before resides inTf fact 

that additions can be executed either by ALUs or BREGs. In the former parameter table me additions 
were meant to be executed by ALUs, whereas in the NML code, these are ni^pS^?^ 



Parameter 


Value 


Vector length 


read:256, write:249 


Reused data set size 




I/O IRAMs [sum -pet] 1 


4 - 25% 


ALU[sum-pct] 


10 - 16% 


BREG [def/route/sum-pct] 


15/2/17-21% 


FREG [def/route/sum-pct] 


16/3/19-24% 



, . 1 ° ' " 111 a luup. dciow is SKetcnea now different iteration* mn 

overlap. First the configuration itself is loaded,^ Config, then the data needed ^Sf 
Ld Iteration 1 The configuration is then executed, Ex Iteration I, and the write-back ohSel WB 
Iteranon 1 takes place. The steady state is contained in the orange box. It is trie tae of the Toon Id 
contains phases of four different iterations. After the kernel has been executed (n-^LnblZ J? 
number of iterations of the loop, the remaining phases are executed S 



99 



WO 2005/010632 



PCT7EP2004/006547 



RAMI Cache Icachel XPP Dcachel IRAMs Bcecute 



o- 

2000- 
4000- 

6000- 






UJ lldcUlun 1 








V ^ ■ 'A " ; - t H " ■--'-""-y'.'-' 

Ss" -V',%' •! «■;*• *~~ || ; 
tfos ••iSftt* t ~ » *> 
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V\6 Iteration 1 




Ld Iteration 2 
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WB Iteration 2 
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1 H Iteration ^ 
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.... ^ *„'>v. , . 




• DCIterafionS i 








12000 
14000 






V\6 Iteration 4 


-' B(lteration:4s > 


;]'■;]!', 







5.5.6 Other Variant 



Source Code 

for (i = 0; i < N-M+l; i++) { 
tmp =0; 

for (j = 0; j < M; j++) 

tmp += c[j] * x [i+M-j-l]; 
x[i] = tmp; 



In this case, it is frivial that the data dependence graph is cyclic due to dependences ontmp. Therefore 
scalar expansion is apphed on the loop, and we obtain in fact the same code as the first version of the 
FIR filter as shown below. 

for {i = 0; i < N-M+l; i++) { 
tmp[i] = 0; 

for (j = 0; j < M; j++) 

tmp[i] += c[j] * x[i+M-j-l]; 
x[i] = tmpti] ; 
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5.6 Matrix Multiplication 



5.6.1 Original Code 

Source code: 

idefine L 10 
#define M 15 
#define N 20 

int A[L] [M] ; 
int B[M] [N] ; 
int R[L] [N] ; 

main ( ) { 

int i, j, k, tmp, aux; 

/* input A (L*M values) */ 
for(i=0; i<L; i++) 
for(j=0; j<M; j++) 

scanf("%d", SA[i][j]) ; 

/* input B (M*N values) */ 
for(i=0; i<M; i++) 
for(j=0; j<N; j++) 

scanf("%d", SB [i] [j ] ) ; 

/* multiply */ 
for(i=0; i<L;i++) 

for(j=0; j<N; j++) { 
aux =0; 

for(k=0; k<M; k++) 

aux += A[i] [k] * B[k] [j] ; 
R[i] Cj] = aux; 

} 

/* write data stream */ 
for(i=0; i<L; i++) 
for(j=0; j<N; j++) 

printf ("%d\n", R[i][j]); 

} 

5.6.2 Preliminary Transformations 

Since no function call is candidate for inlining, no interprocedural code movement is done. 

Of the four loop nests the third one is the only candidate for running partly on the XPP. All others 
have function calls in the loop body and are therefore discarded as candidate very early during the 
compilation process. * 
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r ( i=0 ; i<L; i++ ) Figure 61 Data dependence graph for matrix m 

for(j=0; j<N; j++) { 
aux = 0; 

for(k=0; k<M; k++) 

aux += A[i] [k] * B[k] [j] ; 
[j] = aux; 



The data dependence graph shows no dependence that prevents pipeline vectorization The loon- 
carned true dependence from S2 to itself can be handled by a feedback of aux as described in [1]. 

To get a perfect loop nest we move £/.and S3 inside the Ar-loop. Therefore appropriate guards are 
generated to protect the assignments. The code after this transformation looks like 
for(i=0; i<L;i++) 
for(j=0; j<N; j++) 

for(k=0; k<M; k++) { 
if (k == 0) aux = 0; 
aux += A[i] [k] * B[k] [j] ; 
if (k == M-l) R[i] [j] - aux; 

} 

Our goal is to interchange the ioop nests to improve the array accesses to utilize the cache best 
Urdortunately the guarded statements involving aux cause backward loop-carried anti-dependences 
earned by they-loop. Scalar expansion will break these dependences, allowing loop interchange. 

for(i=0; i<L;i++) 
for(j=0; j<N; j++) 

for(k=0; k<M; k++) { 

if (k == 0) aux[j] = 0; 
aux[j] += A[iJ [k] * B[k][j] ; 
^ if (k == M-l) R[i][j] = aux[j]; 

5.6.3 Loop Interchange for Cache Reuse 

Figure 62 shows the iteration spaces for the array accesses in the main loop. Since arrays in C are 
placed in row major order the cache lines are placed in the array rows. At first sight there seems to be 
no need for optimization because the algorithm requires at least one array access to stride over a 
column. Nevertheless this assumption misses the fact that the access rate is of interest, too Closer 
examination i shows that array R is accessed in every / iteration, while array B is accessed at each 
iteration of the *-loop, which is very likely to produce a cache miss. This leaves a possibility for loop 
interchange to improve cache access as proposed by Kennedy and Allen in [7]. ' 
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cache line 



M 



dr. fc?.* • Mg^s-v,^* '-ffiyg&| ° — 1 — "' "" r ' " 





Figure 62 The visualized array access sequences. 

term is a constant for IohJ££ES oTSSS^ ZT* JTr^"* ^ ™* COSt 
-The term is calculated in three steps. ° P bOUnd for ""^ Ioo P bou nds- 

• First the cost of each reference 1 in the innermost loop body is calculated. It is equal to- 

' J£ ^ refereDCe d ° eS ^ ° n * e IO ° P induCtion of *e (current) innermost 

" ?oni° OT ° P ^ ^ 5? referenCe depends ° n me Io °P fa*«ttaa variable and strides over a non 
contiguous area W1 th respect to the cache layout n 

— , if the reference depends on the loop induction variable and strides over a contiguous 
°~ " " *• l0 ° P ^ * 18 Ste P — - * * ^ cache line size , 
- Second each reference cost is weighted with a factor for each other loop, which is 
■ 1, if the reference does not depend on the loop index 
- the loop count, if the' reference depends on the loop index. 

• Third the overall loop nest cost is calculated by summing the costs of all reference costs 
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Innermost loop 



mm 



l-L-N 



mm 



M 

b ' L 



l-L-N 



l-L-M 



L-M 



B[k][j] 



M-N 



1-M-N 



-M 



Memory access cost 



L-N + -—-L + M-N 
b 



L-N + L-M + M -N 



—{L + M)+L-M 



for(i=0; i<L;i++) 
for(k=0; k<M; k++) 

for(j=0; j<N; j++) { 

if (k == 0) aux[j] = 0; 
aux[j] += A [i] [k] * B [k] [j]; 
if (k == M-l)R[i][j] = aux[j]; 




performance. "F«imzes me cacne-nit rate, thus improving the overall 
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5.6.4 Enhancing parallelism 

After improving the cache access behavior, the possibility for reduction recognition has been 
destroyed. This is a typical example for transformations where one excludes the other Fully unrollinc 
the inner loop is not applicable due to the number of available IRAMs. Therefore we try to unroll-and- 
jam the two innermost loops. 

Unroll-and-Jam 

We unroll the outer loop partially with the unrolling degree u. This factor is computed by the 
minimum of two calculations. 

' u ram = IRAMs available I IRAMS needed 
■ u pae = PAEs available I PAEs needed 

In this example the accesses to A and B depend on k (the loop which will be unrolled). Therefore thev 
Tl ! T!^! t m c L alculation - messes to awe and R do not depend on k. Thus they can be 
subtracted from the available IRAMs, but do not need to be added to the denominator. Therefore we 
calculate = 14/2 = 7 . 

On the other hand the loop body involves two ALU operations (1 add, 1 mult), which yields 
u pae =64/2 = 32 2 . 

The constraint generated by the IRAMs therefore dominates by far as 
u = min(7,32) = 7 . 

To keep the complexity of the configuration simple, we choose an unrolling degree 



U final = loop COltnt I 



loop count 



u 



= 5 



The code after this transformation then reads: 

for (1=0; i<L;i++) { 

for(k=0; k<M; k+= 5) { 
for(j=0; j<N; j++) { 

if (k == 0) aux[j] = 0; 
aux[j] += A[i] [k] *B[k][j]; 
aux[j] += A[i] [k+1] * B[k+l][j]; 
aux[j] += A[i] [k+2] * B[k+2][j]; 
aux[j] += A[i] [k+3] * B[k+3][j]; 
aux[j] += A[i] [k+4] * B[k+4][j]; 
if (k == 10) R[i][j] =aux[j]; 



This is a very inaccurate estimation, since it neither estimates the resources spent by the controlling network, 
which decreases the unroll factor, nor takes it into account that e.g the BREG-PAEs also have an adder which 
mcreases the unrolhng degree Although it has no influence on this example the unrolling degree calculation of 
course has to account for this in a production compiler. & 01 
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5.6.5 Final Code 



After allocation of the arrays and scalars to IRAMs the code running on the RISC looks like follows 

teSS^H m ^? ,h iS n0miaIly Prel ° aded ' aWwu 8 h its value is not in 
Snl icl T p * e ^ l0 °P- Nevertheless it must be preloaded by the other iterations, therefore we 
must issue an XppPreload, not an XppPreloadClecm. 

XppPreloadConf ig ( XppCf gjmatmult) ; 

for(i=0; i<L;i++) { 

XppPreload (12, &aux, N) ; 
XppPreload(0, &A[i][0], M) 
&A[i] [0] , M) 
&A[i] [0] , M) 
&A[i] [0] , M) 
&A[i] [0] , M) 



XppPreload (1, 
XppPreload (2, 
XppPreload (3, 
XppPreload (4, 



XppPreloadClean(ll, &R[i] [0] 
for(k=0; k<M; k+= 5) { 

XppPreload(5, &k, 1) ; 

XppPreload (6, 

XppPreload (7, 

XppPreload (8, 

XppPreload (9, 

XppPreload (10, 

XppExecute ( ) ; 

} 



N) 



&B[k] [0], N) ; 
&B[k+l] [0] , N) 
&B[k+2] [0], N) 
&B[k+3] [0] , N) 
&B[k+4] [0] , N) 



The configuration is shown below. 
v °id _XppCf g_matmult ( ) 



// IRAMs 

// A[i] [k] 

int iram0[128], 

// k 

int iram5 [128] ; 

// B[k] [j] . . B[k+4] [j] 

int iram6[128], iram7[128], 

// R[i] [j] , aux[j] 

int iramll [128] , iraml2 : [128] 

for(j=0; j<N; j++) { 

tmpl = iramO [iram5 [0] ] * 
tmp2 = iraml [iram5 [0] +1] * 
tmp3 = iram2[iram5 [0] +2] * 
tmp4 = iram3[iram5[0]+3] * 
tmp5 = iram4[iram5[0]+4] * 
if (iram5[0] = 0) 

tmp6 = tmpl + tmp2 +tmp3 
else 

tmp6 += iraml2 [j] 
iraml2[j] = tmp6; 
if (iram5[0] = 10) 

iramll [j] = tmp6; 

} 



iraml [128], iram2[128]', iram3[128], iram4[128] 



iram8[128], iram9[128], iramlO [128] ; 



iram6 [ j ] ; 
iram7 [j] ; 
iram8 [j] ; 
iram9[j] ; 
iramlO [j ] ; 



+tmp4 +tmp5; 



tmpl + tmp2 +tmp3 +tmp4 +tmp5; 



The estimated statistics are shown in the table below. Unfortunately the IRAM usage prevents a better 
utilization. Figure 64 shows the dataflow graph of the configuration. 
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Parameter 


Value 


Vector length 


20 


Reused data set size 


. 


I/OIRAMs 


111+10+11/0=13 


ALU 


10 


BREG 


few 


FREG 


few 


Dataflow graph width 


14 


Dataflow graph height 


6 


Configuration cycles 


6+20 = 26 ~^ 




iraml2 



iram5[0] 



iism6(j] 
n0[iram5[0]] 




pwm 




faamlOQ] 
inim4[inim5[0}+l] 







1 Y^ 3 




X V' 



1 



iramll(j] 



» Dataflow graph of matrix multiplication after unroll-and-jam. Counters and address calculations 

are omitted 
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5.6.6 Performance Evaluation 

The next table lists the estimated performance of data transfers. 
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60 


2 


112 


4 




AL.JLOJ 


60 




0 


4 






60 




0 


4 




AWLOJ 


60 




0 


4 




AWloj 1 


60 






4 
















aux, stays in cache 






168 


5 


— ^ 












80 




168' 


5 




vmm 


80 


3 


168 


5 




80 


3 


168 


5 


BDF2JLDJ ■ — ■ 


80 


3 


168 


5 


■ 


80 
80 


3 


168 


5 
5 


aux, stays in cache 






















aux, stays in cache 


80 
80 




96 


Si 30 
5j 10 


k, written back in i loop 





^ 7 ~? apans ? 1 Wlth 1116 ^ fer f n « e sy 8 ^ we assume that first the configuration, the first five 
t f i f l I r^ fr^ "* P? 10 ^ row **** « Ae nine subsequent iterations of the i- 

Sh y flVeAf ' m m P reIoaded > row All loads of A[i][0] cause one cache miss and 



Furfcennore we assume that all values of B are loaded into the cache during execution of the first 

Til I « I " 1 °,° P ; V*. ^ durfng ±e 0ther iterations - ™ us ca ' he «d misses due to 
accesses to 5 are only titan l into account three times, rovj-loop All subsequent 27 * 5 accesses 
to B cause only cache-IRAM transfers, tow j-loop i!=0. We assume that aux stays in its ERAM or is 

2? T^T ^ • CaChe , ^ Wh ° le execution - While the first assumption assumes that no 
task switch occurs during calculation of the whole matrix - a fact that we cannot guarantee - the 

JJ^^Vfi ^ * ° Ue t0 the dominanc e of *• execution cycles neither has an 

impact on the total performance. 

2? i^JT °S? ?r'i° W / 5 * Sh ° WS 46 writo *«*» of * e result matrix/?, which occur ten times 
and are also added to the other terms. 



The hand coded configuration cycles are measured to 55 XPP cycles, or 1 1 0 cache cycles. 



configurations 


. Da© Access^ 
RAM DCachid 


Gpnf^raffon" 
RAk ICache 


"qprej Cache 'iRAM 




•Cort^ 


Speedup:":7 
C&heiT&%.] 


startup i-loop 
steady i-loop 
j-loop i=0 
j-loop i!=0 
WBR 


2m 25 
112 25 
840 30 
35 

96 5 


1232 687 


WVi 687| 1312' 
••^l 25'- ;• 1-12 
•110 II0j.r- 840 

: .:lXq iioj . ; iiQ 




tv - 
^N,-.' ■'-'*>-. 
i'i. CT-* 






sum 


4768 




j^ygj 4262s- _^7(J 









The final utilization is shown in the next table. 
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Parameter 


Value 


Vector length 


20 


Reused data set size 


- 


I/O IRAMs [sum -pet] 


13 - 82% 


ALU [sum-pet] 


13-20% 


BREG [deCroute/sum-pct] 


10/27/37-46% 


FREG [def/route/sum-pct] 


17/9/28 - 35% 
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5.7 Viterbi Encoder 



5.7.1 Original Code 

Source Code: 

/* C-language butterfly */ 
#define BFLY(i) {\ 

unsigned char metric, mO, ml, decision; \ 

metric = ( (Branchtab29_l [i] ~ syml) + 

(Branchtab29_2[i] * S ym2) + l)/2;\ 
mO = vp->old_metrics [i] + metric; \ 
ml = vp->old_metrics[i+128]: + (15 - metric) ;\ 
decision = (mO-ml) >= 0;\ 

vp->new_metrics[2*i] = decision ? ml : m0;\ 
vp->dp->w[i/16] I = decision « ( (2*i) &31) ;\ 
mO -= (metric+metric-15) ; \ 
ml += (metric+metric-15) ; \ 
decision = (mO-ml) >= 0;\ 

vp->new_metrics[2*i+l] = decision ? ml : m0;\ 
^ vp->dp->w[i/16] | = decision << ( (2*i+l) &31) ; \ 

int^update_viterbi29 (void *p, unsigned char syml, unsigned char sym2) { 

struct v29 *vp = p; 
unsigned char *tmp; 
int normalize = 0; 

for (i=0 ; i<8 ; i++ ) 
vp->dp->w[i] = 0; 

f or ( i=0 ; i<12 8 ; i++ ) 
BFLY(i) ; 

/* Renormalize metrics */. 
if (vp->new_metrics [0] > 150) { 
int i; 

unsigned char minmetric = 255; 

for(i=0;i<64;i++) 

if (vp->new_metrics [i] < minmetric) 

minmetric = vp->new_metrics [i] ; 
f or ( i=0 ; i<64 ; i++ ) 

vp->new_metrics [i] -= minmetric; 
normalize — minmetric; 

} 

vp->dp++; 

tmp = vp->old_metrics ; 
. vp->old_metrics = vp->new_metrics; 
vp->new_metrics = tmp; 

return normalize; 

} 
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5.7.2 Interprocedural Optimizations and Scalar Transformations 

Since no function call is candidate for inlining, no interprocedural code movement is done. 

After expression simplification, strength reduction, SSA renaming, copy coalescing and idiom 
recognition, the code looks like below, where statements were reordered for convenience 
Note that idiom recognition will find the combination of minO and use of the comparison result for 
decision and _decision. However the resulting computation cannot be expressed in C, so we describe it 
as a comment: 

int update_viterbi29(void *p,unsigned char syml, unsigned char sym2) { 
int 1; 

struct v29 *vp = p; 
unsigned char *tmp; 
int normalize =0; 

char *_vpdpw_= vp->dp->w; 
for(i=0;i<8;i++) 
*_vpdpw_++ = 0; 

char *_bt29_l= Branchtab29_l; 

char *_bt29_2= Branchtab29_2 ; 

char *_vpom0= vp->old_metrics; 

char *_vpoml28= vp->old_metrics+128; 

char *_vpnm= vp->new_metrics ; 

char *_vpdpw= vp->dp->w; 

for (i=0;i<128;i++) { 

unsigned char metric, _tmp, mO , ml , _m0 , _ml , decision,_decision; 

metric = ( (*_bt29_l++ A syml) + 

<*_bt2 9_2++ * sym2) + 1) /2; 
_tmp= (metric+metric-15) ; 
mO «= *_vpom++ + metric; 
ml = *_vpoml28++ + (15 - metric); 
_m0 = mO - _tmp; 
_ml = ml + _tmp; 
// decision = mO >= ml; 
// _decision - _m0 >= _ml; 

*_vpnm++ = min(m0,ml); // = decision ? ml : mO 

*_vpnm++ = min(_m0,_ml); // = _decision ? ml : mO 

_vpdpw[i » 4] |= ( mO >= ml) /* decision*/ << (72*i) J 3 i) 
I <_m0 >= _ml) /*_decision*/ « ( (2*i+l) &31) ; 



/* Renormalize metrics */ 
if (vp->new_me tries [0] > 150) { 
int i ; 

unsigned char minmetric = 255; 

char *_vpnm= vp->new_metrics; 
for ( i=0 ; i<64 ; i++ ) 

minmetric = min (minmetric, *vpnm++) ; 

char *_vpnm= vp->new_metrics ; 
for (i=0;i<64;i++) 

*vpnm++ -= minmetric; 
normalize = minmetric; 
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vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics = vp->new_metrics; 
vp->new_metrics = tmp; 

return normalize; 

} 



5.7.3 Initialization and Butterfly Loop 

2ecS of tSp !Tv ^ ^ hiCh ^ 6 f 170 maCr ° haS been "I*"". - e * Merest for being 
loopf y ' ne6d ^ examination - Be ^w is the configuration source codeTf 

/** XppCfg_viterbi29 () 

* Performs viterbi butterfly "loop 

* XPPIN: iram0,2 contains Branchtab29_l and Branchtab29 2, respectively 

IrZl l coSa ^ ° ld T metriCS °W--»trica + 128 rH^cS^ 

* XPPOUT • iltZl c ° ntain f scalars syml and sym2, respectively 
^ XPPOUT iram6 contains the new metrics array 

iram7 contains the decision array 

joid XppCfg_viterbi29() 

// IRAMs in FIFO mode • 
// 

char *iram0; // Branchtab29_l, read access with 32-to-8-bit converter 

c^r *irS: // ^ a >°^ ab2 f- 2 ' rSad aCCSSS W±th 32-to-i-SS conv^r 
Lnar iram4, // vp->old metrics, read acrpqq „ifv, r> 0 i o v.-*. 

converter"*" 6 '' " "" > ™- ,8M « s ' »"<== access with 16-to-32-bit 

// IRAMs in RAM mode 
// 

int ±raml[128],- // syml, read access 
int iram3[128]; // sym 2, read access 
int iram7[128]; // vp->dp->w, write access 

int i; 

unsigned char syml, sym2; 

syml = iraml [0] ; 
sym2 = iraio3 [0] ; 

for (i=0;i<8;i++) 
iram7 [i] = 0; 

for (i=0;i<128;i++) { 

unsigned char metric f _tmp, m0,ml,_m0,_ml; 

metric = {(*iram0++ A syml) + 

(*iram2++ A sym2) + l)/2; 
_tmp= (metric « 1) —15; 
m0 = *iram4++ + metric; 
ml = *iram5++ + (15 - metric); 
_m0 = m0 - _tmp; 
_ml = ml + _tmp; 
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// assuming big endian; little endian has the shift on the latt 
*iram6++ = (min(m0,ml) « 8) | min (_m0 , _ml ) ; 
iram7[i » 4] |= ( mO >= ml) « ( (2*i) & 31) 
I (_m0 >= _ml) « ( (2*i+l) &31) ; 



The dataflow graph is as follows (the 32-to-8-bit converters are not shown). The solid lines represent 
flow of data, while the dashed lines represent flow of events: 
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Btab29_l 
iramO 




syml 
Iraml 




Btab29_2 
iram2 




sym2 
iram3 




oldmetrics 
iram4 



oldmetrics+128 
iramS 




cnt 
2i 









newmetrics 
iram6 



vp->dp->w 
iram7 



I 



The recurrence on the IRAM7 access needs at least 2 cycles, i.e. 2 cycles are needed for each input 
value. Therefore a total of 256 cycles are needed for a vector length of 128. 
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Parameter 


Value 


Vector length 


read: 32f=128 chars'! writp-fid/"— isc 


, 


chars) 


Reused data set size 


- 


I/O IRAMs 


61+20 


ALU 


26 


BREG 


few 


FREG 


few 


Dataflow graph width 


4 


Dataflow graph height 


12+4 (32-to-8-bit converters) 


Configuration cycles 


16+256 



once, and accumulate the bits in a temporary variable, writing the value to ft iS at ft > en IcSSS 
UJ. .tiling wtt a ffl. size o, ,« alfo e.Ljei ffi^^SESSrtL S vaTuL-ISe 

All remaining input IRAMs are character (8-bit) based Ther^f™. n + n s uv 

rfsiSS C0 ' ,lig,lra,iO ' , ^ code is Usted bel0 ». **« has been omitted for the sake 

/ * * _XppCf g_viterbi2 9 ( ) 

* Performs viterbi butterfly loop 

* XPPIN: SSS's oonS lnS B ^ nch ^ 1 ? 29 - 1 and Branchtab29_2, respectively 

?S£f'? contaxns old_metrics and old_metric S+ 128, respectively 

* XPPOUT- SS' contains scalars syml and sym^, respectively PSCtlVely 
^ XPPOUT iram6 contains the new metrics array 

iram7 contains the decision array 

void XppCfg_viterbi29() 

// IRAMs in FIFO mode 
// 

char *iram0; // Branchtab29_l, read access with 32-to-8-bit converter 
oW I**™/ '/, Branchtafa 29_2, read access with 32-to-8-bit converter 
converter " ^'^^-^rics, read access with 32^0-8-01? ^ 

converted " ^"^-^trics^e, read access with 32-to-8-bit 

converter^ " ^ >n "»-»»tr±os, write access with 16-to-32-bit 
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unsigned long *iram7; // vp->dp->w, write access 

// IRAMs in RAM mode 
// 

int iraml[128]; // syml, read access 
int iram3 [128] ; // sym2, read access 

int i, i2; 
int rise; 

unsigned char syml, sym2; 

syml = iraml [0] ; 
sym2 = iram3 [0] ; 

for (i=0;i<8;i++) { 
rlse= 0; 

for(i2=0;i2<32;i2+=2) { // unrolled once 

unsigned char metric, _tmp, mO , ml , _m0 , _ml ; 

metric = ( (*iram0++ A syml) + 

(*iram2++ A sym2) + l)/2; 
_tmp= (metric « 1) -15; 
m0 = *iram4++ + metric- 
mi = *iram5++ + (15 - metric); 
_m0 = mO - _tmp ; 
_ml = ml + _tmp; 

*iram6+'+ = (min(m0,ml) « 8) | min (_m0 , _ml ) ; 
rise = rise | ( mO >= ml) « i2 
^ I (_m0 >= _ml) « (i2+l) ; 

*iram7++ = rise; 



116 



WO 2005/010632 

PCT7EP2004/006547 

The modified dataflow graph, where unrolling and splitting have been omitted for simplicity: 



Btab29_1 
iramO 




syml 
iraml 




Btab29_2 
iram2~ 




sym2 
iram3 




oldmetrics 
iram4 



oldmetrics+128 
iram5 





m0 


(mm) 


_m1 











7 






cnt 
i2=2*I 



















cnt 
_i=[0..7] 
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Again, the recurrence with the rise scalar needs two cycles. With an unrolling factor of two 128 cycles 
are needed for a vector length of 128. ' cvcles 



Parameter 


Value 


Vector length 


32 (read)/ 64 (write) 


Reused data set size 


- 


I/O IRAMs 


61+20 


ALU 


2*26+2 (pin) = 62 


BREG 


few 


FREG 


few 


Dataflow graph width 


4 


Dataflow graph height 


12 + 4 (32-to-8-bit converters) =16 


Configuration cycles 


16+128 



5.7.4 Re-Normalization: 

^^^^T f a ^J*"** ±e i-P* for ^^hnum and a second loop that subtracts 
^ SSTJr T, 6ntS - S ere , IS 3 ^ de P endence betw <** all iterations of the first loop and 

SISSK^^ p - refore ^ ^ loops cannot be merged or ^ ^ be 

Minimum Search 

oode^ttedbiw^^ S6arCh " ° f byteS " ^ ^ V6rSi0n ° f *» «»fig«*fcm *>urce 

/** XppCf g_calcmin ( ) 

* Performs a minimum search over a character array 

* XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 
*/ 



v °id XppCfg calcminf) 

{ 

// IRAMs in FIFO mode 
// 



unsigned char *iram6; // vp->new_metrics, read access with 32-to-8-bit 



converter 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetric = 255; 

for(i=0;i<64;i++) { 

minmetric = min (minmetric, *iram6++) ; 

iram0[0] - minmetric; 

> 
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As there is a recurrence with minmetric which needs two cycles, a total of 128 cycles are needed for a 
vector length of 64. 



Parameter 


Value 


Vector length 


16 (=64 chars) 


Reused data set size 


- 


TD A A A ~ 

JJKAMs 


1+1 


ALU 


2. 


BREG 


2 


FREG 


3 


Dataflow graph width 


1 


Dataflow graph height 


1 + 4 (32-to-8-bit converter) 


Configuration cycles 


5+128 



Reduction recognition eliminates the dependence on minmetric enabling loop unrolling with an 
unrolling factor of 4 to utilize the IRAM width of 32 bits. A split network has to be added to separate 
the 8-bit streams using 3 SHIFT and 3 AND operations. Tree balancing redistributes the minO 
operations to minimize the tree height. 

/** XppCf q_calcmin ( ) 

* Performs a minimum search over a character array 

* XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 
*/ 

void XppCf g_calcmin ( ) 



{ 



// IRAMs in FIFO mode 
// 

int *iram6; // vp->new_metrics, read access 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetric = 255; 

for (i=0;i<16;i++) { 
unsigned long val; 

val - *iram6++; 

minmetric - min (minmetric , min ( min (val & Oxff, (val » 8) & Oxff), 

min((val » 16) & Oxff, val » 24) )); 



iram0[0] = (long) minmetric; 
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Parameter 


Value 


Vector length 


16 


Reused data set size 


- 


I/OIRAMs 


H+IO 


ALU 


8 


BREG 


5 


rlvCiij 


3 


Dataflow graph width 


4 


Dataflow graph height 


5 


Configuration cycles 


5+32 



The recurrence of two cycles makes it profitable to double the loop body. Reduction recognition aeain 
elunmates the loop-carried dependence on minmetric, enabling loop tilmg and then u^oCd-ia^ to 
mcrease parallels. Constant propagation and tree rebalancing reduce L depenSenTheSt oT&e 
final merging expression. The final configuration source code is listed below: 



/** XppCf g_calcmin ( ) 

* Performs a minimum search over a character array 
XPPIN: iram6 contains the character input array 
XPPOUT: iramO contains the minimum value 

. */ 

v °id XppCfg calcmin() 

{ 

// IRAMs in FIFO mode 
// 

. int *iram6; // vp->new_metrics, read access 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetricO = 255, minmetricl = 255; 

for (i=0;i<8;i++) { 
unsigned long val; 

val = *iram6++; 

minmetricO = min (minmetricO , min{ min(val & Oxff, (val » 8) & Oxff), 

Tr= , min ((val » 16) & Oxff, val » 24) )); 

val = *iram6++; ' ' ' ' 

minmetricl = min (minmetricO , min( min (val & Oxff, (val » 8) & Oxff), 

min ((val » 16) & Oxff, val » 24) )); 

iram0[0] - (long) min (minmetricO, minmetricl); 
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Parameter 


Value 


Vector length 


16 


Reused data set size 


. 


I/OIRAMs 


H+IO 


ALU 


16 


BREG 


10 




0 


Dataflow graph width 


2*4 =8 


Dataflow graph height 


5 


Configuration cycles 


5+16 



Re-Normalization 

^!. ; 5! lrth -i OOP SllbtraC !. S mmimum of *e third loop from each element in the array The read 

ZSSST op to be brokeQ up into ^ IRAMs - othe ™ se SSl^ 35 W 

/** XppCf g_subtract ( ) 

* Subtracts a scalar from a character array 

+ XPPIN: iram6 contains the character input array 

* XPPOUT- J™? con ^irts the scalar which is subtracted 
^ XPPOUT: iraml contains the result array 

void XppCf g_subtract ( ) 

// IRAMs in FIFO mode 
// 

convSef Char * ±ram6; " ^-> n ---trics, read access with 32-to-8-bit 

con^rSSf Char * iram1 '' " ^-^--^trics, write access with 8-to-32-bit 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, read access 
int i; 

unsigned char minmetric = iram0[0]; 

for (i=0;i<16;i++) { 

^ iraml++ = *iram6++ - minmetric; 



121 



WO 2005/010632 



PCT7EP2004/006547 



Parameter 


Value 


Vector length 


16 (= 64 chars) 


Reused data set size 


- 


I/OIRAMs 


21+10 


ALU 


1+2 (converters) 


BREG 


2 (converters) 


FREG 


2 (converters) 


Dataflow graph width 


1 


Dataflow graph height 


1 + 8 (converters) 


Configuration cycles 


9+64 



There is no loop-earned dependence. Since the size of the data is 8 bits, the inner loop can be unrolled 
four tunes without exceeding the IRAM bandwidth requirements. Networks splitting the 32-bit stream 
into 4 8-bit streams, and re-joining the individual results to a common 32-bit result stream are 
inserted. The final configuration source code is listed below: 

/** XppCf g_subtract ( ) 

* Subtracts a scalar from a character array 

* XPPIN: iram6 contains the character input array 

iramO contains the scalar which is subtracted 

* XPPOUT: iraml contains the result array 
*/ 

void XppC£g_subtract ( ) 

{ 

// IRAMs in FIFO mode 
// 

int *iram6; // vp->new_metrics , read access 
int *iraml; // vp->new_metrics, write access 

// IRAMs in RAM mode 
// 

int iramO [128] ; // minmetric, read access 
int i; 

unsigned char minmetric = iram0[0]; 

for(i=0;i<16;i++) { 
unsigned long val; 
unsigned char rO, rl, r2, r3; 

val = *iram6++; 

rO = (val & Oxff) - minmetric; 
rl = ((val » 8) & Oxff) - minmetric; 
r2 = ((val » 16) & Oxff) - minmetric; 
r3 = (val » 24) - minmetric; 



} 



"iraml++ - (r3 « 24) | (r2 « 16) | (rl « 8) | rO; 



} 
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Parameter 


Value 


Vector length 




Reused data set size 




I/OIRAMs 


21+ lO 


ALU 


11 


BREG 


6 


FREG J 


0 


Dataflow graph width 


4 


Dataflow graph height 


5 


Configuration cycles 


5+16=21 



5.7.5 Final Code 



The code executed on the RISC is listed below. It starts the configurations: 

int update_viterbi29(void *p,un S igned char syml, unsigned char S ym2) 

struct v29 *vp = p; 
unsigned char *tmp; 
int normalize = 0; 

long _syml = syml; 
long _sym2 = sym2; 

XppPreloadConf ig <_ X ppCfg viterbi29) ; 
X PP Preload(0, Branchtab29~l, 32); 
XppPreload(2, Branchtab29_2, 32); 
XppPreload(4, vp->old_metrics, 32) • 
X PP Preload(5, vp->old metrics + 128 321- 
XppPreload(l, S_syml,~l) ; 
XppPreload(3, &_sym2, 1) ; 
XppPreloadClean(6, vp->new metrics, 64); 
.Xp P PreloadClean(7, vp->dp-> w , 8) ; 
XppExecute { ) ; 

/* Renormalize metrics */ 
if (vp->new_me tries [0] > 150) { 
long minmetric; 

XppPreloadConf ig (_XppCf g_calcmin) ; 
Xp P PreloadClean(0, Sminmetric, 1) ; 
XppExecute ( ) ; 

XppPreloadConf ig (_XppCf g_subtract) ; 
XppPreloadClean(5, vp->new metrics, 16); 
XppExecute () ,• - 

XppSync ( Sminmetric, 1 ) ; 

normalize = minmetric; 

XppSync (vp->new_me tries , 64 ) ; 
. vp->dp++ ; 
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trap = vp->old_metrics; 
vp->old_metrics - vp->new_metrics ; 
vp->new_metrics = tmp; 



return normalize; 



The three configurations are shown in the following: 



/** XppCfg_viterbi29() 

* Performs viterbi butterfly loop 

* XPPIN: iram0,2 contains Branchtab29_l and Branchtab29 2 




, respectively 
, respectively 



iram7 contains the decision array 



v °id XppCfg_viterbi29() 



// IRAMs in FIFO mode 
// 




; // vp->new_metrics, write access with 16-to-32-bit 



unsigned long *iram7; // vp->dp->w, write access 
// IRAMs in RAM mode 



int iraml[128]; // syml, read access 
int iram3[128] ; // S ym2, read access 

int i, i2; 
int rise; 

unsigned char syml, sym2; 

syml = iraml[0]; 
sym2 = iram3 [0] ; 

for (i=0;i<8;i++) { 
rlse= 0; 

for(i2=0;i2<32;i2+=2) { // unrolled once 
unsigned char metric, _tmp, m0,ml,_mO,_ml; 

metric = ((*iramO++ A syml) + 
(*iram2++ A sym2) + 1) /2; 
_tmp= (metric « 1) -15 ; 
m0 = *iram4++ + metric- 
mi = *iram5++ + (15 - metric); 
_m0 = mO - _tmp; 
_ml = ml + _tmp; 

*iram6++ = (min(m0,ml) « 8) | min(_mO, ml); 
rise = rise | ( mO >= ml) « i2 
^ I (_m0 >= _ml) « (i2+l); 

*iram7++ = rise; 

} 



// 
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} 



/** XppCfg_calcmin() 

I v^™. 3 minimum search over a character array 

* 5»™ ^ ram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 
/ 



v °id XppCfg calcmin() 

{ 

// IRAMs in FIFO mode 
// 

int *iram6; // vp->new_metrics, read access 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, write access 
int i; ■ • 

unsigned char minmetricO = 255, minmetricl = 255; 

for (i=0;i<16;i++) { 
unsigned long val; 

val = *iram6++; 

minmetricO = min (minmetricO , min ( min(val & Oxff, (val » 8) & Oxff) 

val = *iram6 ++ ; • min ((val » 16) & Oxff, val » 24) )); ' 

minmetricl = min (minmetricO , min( min (val & Oxff, (val » 8) & Oxff) 

. } min ((val » 16) & Oxff, val » 24) )); ' 

iram0[0] = (long)min'(minmetricO, minmetricl) ; 



} 



/** XppCf g_subtract ( ) 

* Subtracts a scalar from a character array 

^ XPPIN: iram6 contains the character input array 

* xppnnT . ^ ram ? con tains the scalar which is subtracted 
^ XPPOUT: iraml contains the result array 

v °id _XppCf g_subtract ( ) 

// IRAMs in FIFO mode 
// 

int *iram6; // vp->new_metrics, read access 
int *xraml; // vp->new_metrics, write access 

// IRAMs in RAM mode 
// 

int iram0[128]; // minmetric, read access 
int i; 

unsigned char minmetric = iram0[0]; 

for (i=0;i<16;i++) { 
unsigned long val; 
unsigned char rO, rl, r2, r3; 

val = *iram6++; 

rO = (val & Oxff) - minmetric; 

rl = ((val » 8) & Oxff) - minmetric; 

r2 = ((val » 16) & Oxff) - minmetric; 
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r3 - (val » 24) - minmetric; 
^ *iraml ++ = (r3 « 24) , (r2 K< lg) , (r] _ <<; Q) ( ^ 



} 



5.7.6 Performance Evaluation 

that the if condition in the source code is true! i.e. nL^Zfif^T ^ * * 




configurations 


preloads 


write-backs 


Data RAM 


DCache 


viterbi29 


Branchtab29_l 
Branchtab29_2 
vp->old_metrics 
vp->old_metrics+128 


vp->new_metrics 
vp->dp->w 


1352 


52 
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syml 
sym2 








calcmin 


vp->new_metrics 


tn i ti tr><*ti*t /* 


JJU 


17 


subtract 


vp->new_metrics 
minmetric 


vp->new_metrics 


760 


33 


all configurations 


Branchtab29_l 

Branchtab29_2 

vp->old_metrics 

vp->old_metrics+ 128 

syml 

sym2 


vp->dp->w 

minmetric 

vp->new_metrics 


1440 


53 



In the following tables the performance is compared to the reference system 

of taking the maximum (col™ JffPtaSS^^^^^ 1 ^ CydeS ,nstead 



configurations 










^&Cachei.i^' 


vitenji29 (unrolling) 
viterbi29 (no unrolling) 
calcmin 
subtract 

all ergs (unrolling) 


1362 ■ — 52 
1352 52 
536 17 
760 33 


yfa-aa 1377 

5432 770 
3024 429 
1736 245 
14392 2051 
10136 1444 


if ^8 1410j"c8if42 

p^f 2602^|38it 
*g20j 2217|^7^I 


" 3624.;4976 
3624.^4976 
256^ 
192>g952 


2,6pol 


all cfgs (no unrolling) 


I44U 53 
1440 53 


4072t55i2 
4072^12 


■rite 1.6 -013! 
1.8 



configurations 
"viierDi29 (unrolling) 


RAM~£5ciacne 


^nTigtSaTiQrr 
RAM'lbacHi 


•(^r^c^criefek^ 


Tpfr^stemT 
l&cnef {JAM 




viterbi29 (no unrolling) 

calcmin 

subtract 

all ergs (unrolling) 


1352 32 

1352 52 
536 17 
760 33 




itm 588' 4352 
f^SBJ 56&S3§ 
•"^381 76%Z60 


36241 4576 
3624l;49}6 
256'r.7J92 


v-6;2 e.2'^'3,71 
#,| '4,6|>ri 3 S 
'■'•■98 9 S 


all ergs (no unrolling) 


1440 §3 

1440 53 




72 ° pto 


40^5513 

4072[^§|| 


5.7 


?-.;3,8| 
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For viterbi a significant performance improvement up to a factor of 8.2 can be achieved using the XPP 
compared to the reference system. 

The final utilization is shown in the following tables. The information is taken from the • info' files 
generated from the NML source code by the XMAP tool. 

Utilization of the viterbi29 configuration with unrolling (yit.nmt): 



Parameter 


Value 


Vector length 


read:32, write:64 


Reused data set size 




I/O IRAMs [sum -pet] 


8 - 50% 


ALU[sum-pct] 


47 - 73% 


BREG [deff route/sum-pct] 


27/37/64 - .80% 


FREG [deffroute/sum-pct] 


24/27/51-64% 


of the viterbi29 configuration without unrolling (yit_nounroll.nmt): 


Parameter 


Value 


Vector length 


read:32, write:64 


Reused data set size 




I/O IRAMs [sum -pet] 


8 - 50% 


ALU[sum-pct] 


25 - 39% 


BREG [deflroute/sum-pct] 


18/23/41-51% 


FREG [def/route/sum-pct] 


18/11/29-36% 


of the calcmin configuration (min.nml): 


Parameter 


Value 


Vector length 


16 


Reused data set size 




I/O IRAMs [sum -pet] 


2 - 13% 


ALU[sum-pct] 


19 - 30% 


BREG [def route/sum-pct] 


14/16/30 - 38% 


FREG [def route/sum-pct] 


7/6/13 - 16% 



Utilization of the subtract configuration (sub.nml): 
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Parameter 


Value 


Vector length 


16 


Reused data set size 


- 


I/O IRAMs [sum -pet] 


3 - 19% 


ALU[sum-pct] 


11 - 17% 


BREG [def/route/sum-pct] 


6/10/16-20% 


FREG [deCroute/sum-pct] 


2/9/11 - 14% 
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5.8 MPEG2 Codec - Quantization 

The quantization file contains routines for quantization and inverse quantization of 8x8 macro blocks 
These functions differ for intra and non-intra blocks, and furthermore the encoder distinguishes 
between MPEG 1 and MPEG2 inverse quantization. 

Since all functions have the same layout, i.e. some checks, one main loop running over the macro 
block quantizing with a quantization matrix, we concentrate on iquantjntra, the inverse quantization 
of intra-blocks, since it contains all elements found in the other procedures. The non intra 
quantization loop bodies are more complicated, but add no compiler complexity. In the source code 
the MPEG1 part is already inlined, which is straightforward since the function is statically defined and 
contains no function calls itself. Therefore the compiler inlines it, and dead function elimination 
removes the whole definition. 

5.8.1 Original Code 

void iquant_intra (src, dst, dc_prec, quant_mat , mquant ) 
short *src, *dst; 
int dc_jprec; 

unsigned char *quant_mat; 

int mquant; 

{ 

int i, val, sum; 

if (mpegl) { 

dst[0] = src[0] « (3-dc_prec); 

for (i=l; i<64; 

{ 

val = (int) (src[i]*quant_mat[i] *mquant) /16; 

/* mismatch control */ 
if ((val&l)==0 && val!=0) 
val+= (val>0) ? -1 : 1; 



} 

else 
{ 



/* saturation */ 

dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); 



} 



sum = dst[0] = src[0] « (3-dcjprec) ; 

for (i=l; i<64; i++) 

{ 

val = (int) (src[i]*quant_mat[i]*mquant)/16; 
^ sum+= dst[i] = (val>2047) ? 2047 : ( (val<-2048) ? -2048 : val); 

/* mismatch control */ 
if ((sum&l)==0) 
dst [63] 1; 



} 

In the following subsections we concentrate on the MPEG2 part 



130 



WO 2005/010632 

PCT/EP2004/006547 

5.8.2 Preliminary Transformations 

Interprocedural Optimizations 

Analyzing the loop bodies shows that they easily fit on the XPP array and do not use the maximum of 
resources by far. The function is called three times from module^.c. With mter-moduS^nction 
mluung the code for the function call disappears and is replaced with the function. Therefore it rS: 
for (k=0; k<mb_height*mb_width; k++) { 
if (mbinfo [k] .mb_type & MB_INTRA) 
for (j=0; j<block_count; j++) 
if (mpegl) { 

/* omitted */ 
} else { 

sum = dst[0] = src[0] « (3-dc prec) ; 

for (i=l; i<64; i++) 

{ 

val = (int) (src[i]*quant_mat[i] *mquant) /16; 
^ sum+= dst[i] = (val>2047) ? 2047 : {(val<-2048) ? -2048 : val) ; 

/* mismatch control */ 
if ( (sum&l) ==0) 
dst[63] A = 1; 

} 

else 

/* non intra block part omitted */ 

Basic Transformations 

The following transformations are done: 

- A peephole optimization reduces the division by 16 to a right shift by 4. This is essential since we 
do not consider loop bodies containing division for the XPP. 

- Idiom recognition reduces the statement after the comment /* saturation */ to 
dst[i] = min(max (val, -2048), 2047). 

" Si 6 i*? t gl ° ba i VJ f & ^u Pe f dOCS n0t Change Within loo P> Io °P unswitching moves the 
control statement outside they-loop and produces two loop nests. 

■ Partial redundancy elimination inserts temporaries which store intermediate results. 

■ Reads from arrays are stored in temporaries and moved as early as possible. 

■ Writes to arrays are moved as late as possible. 

ibSbr!* ° 0de ■** ^ *"* ^ sfonnations - MPEGl part again is omitted, but looks 

for (k=0; k<mb_height*mb_width,- k++) { 
if (mbinfo [k] .mb_type & MB_INTRA) 
if (mpegl) 
/* omitted */ 
else 

for (j=0; j<block_count; j++) { 

block_data = blocks [k*block_count+j ] [0] ; 
tmpl = block_data « (3-dcjprec) ; ■ 
sum = tmpl 

blocks [k*block__count+j] [0] = tmpl; 
for (i=l; i<64; i++) { 

block_data = blocks [k*block_count+j ] [i] ; 
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mat_data = intra_q [i] ; 

val = (int) { block_data * mat_data *mquant)»4; 
tmp2 = min(2047, max (-2048 , val) ) ; 
sum += tmp2; 

blocks [k*block_count+j] [i] = tmp2; 
/* mismatch control */ 

block_data = blocks [k*block count+i 1 [631 ; 
if ((sum&l)==0) { 
block_data A = 1; 

} 

^ blocks [k*block_count+j] [63] = block_data; 

^m^l^K^? ? mn ° n ±C m theref0re We t0 increase *° size ^ the loop body 
Z^T r . Bef ° re WC mCTeaSe P 3 ™ 11 ^ 81 * *e next subsection shows an optimization whkh 
transforms the loop nest into a perfect loop nest. vyuMuuaum wmcn 

Inverse Loop-Invariant Code Motion 

TTie loop-invariant statements surrounding the loop body are candidates for inverse loop invariant code 
SHL ? • m0 TI thein mt ° ^ l0 ° P b ° dy and « uanUn 8 *«» P r °P erl y ^ loop nest geSpS a^d 
^SSSS. l0 ° P iDCreaSeS - SinCC *** °P timization » 'eveLle *cJ£££ 

This time we only show the two innermost loop nests. 

for (j=0; j<block_count; j++) { 
for (i=0; i<64; i++) { 

block_data = blocks [k*block_count+j ] [i] ; 

mat_data = intra_q [±] ; 

sol_0 = block_data « (3-dc_prec); 

S °i-n-!! = (int) ( bl °ck_data * mat_data *mquant)»4; 
sat_l_63 = min(2047, max (-2048, sol 1 63)); 

guardl = (i==0) ; 

guard2 = (i==63) ; 
if (guardl) 

sol = sol_0; 
else 

sol = sat_l_63; 

if (guardl) 

sum = sol; 
else 

sum += sol; 

guard3 = ((sum & 1) = 0) ; 
if (guard2 && guard3) 
sol A = 1 

blocks [k*block_count+j] [i] = so l; 

} 

The following table shows the estimated utilization and performance by a configuration synthesized 
from the inner loop. The values show that there are many resources left for farther ^tirnmtiZ 
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Parameter 


Value 


Vector length 


32(64 16-bit values) 


Reused data set size 


. 


I/OIRAMs 


4 


ALU 


9 


BREG 


9 


r KJblj 


few 


Dataflow graph width 


4 


Dataflow graph height 


7+2 (converters) 


Configuration cycles 


9+64 



5.8.3 Enhancing parallelism 

To increase parallelism we have two possibilities, which can be combined: 

■ Since the smallest data type used in the inner loop limits the throughput of the synthesized 
pipeline, we must try to improve this throughput. This is shown in the next subsection SyntbeS,Zed 

" ^ ! S Cmd ^ f ? r unroll-and-jam when interprocedural value range analysis finds out 

that block_count can only have the values 6,8 or 12. «"«yMs nnas out 

Loop Distribution, Partial Unrolling, Reduction Recognition, Loop Fusion 

The conversion of the 8-bit values due to the unsigned character array containing the quantization 
matax limits the throughput of the pipeline. In the best case only ever^ fouru^ "yd , lvalue c ^ be 
read or written from the IRAM. Therefore we must try to increase^ the uLughputby plittt tne^! 
bit value into 8-bit values, and process them concurrently in different pipdLs UnfoStelv the 
loop-earned true dependence due to the accesses to sum prevents a smfpfe partial umSfwhith 
would achieve this. Loop distribution overcomes this problem. unrolling which 

Loop Distribution 

de P ende °<* from a read of sum to a write of block data in the code, it is possible to 
distnbute the innermost loop into two loops. The first loop also absorbs the guarded iZ-SvariaS 
code which represents the first iteration. e««""cu u>op invariant 

for (j=0; j<block_count; j++) { 
for (i=0; i<64; i++) { 

block_data = blocks [k*block_count+j ] [i] ; 

mat_data = intra_q [i] ; 

sol_0 = block_data « (3-dcjprec) ; 

sol_l_63 = (int) ( block_data * mat_data *mquant)»4; 
sat__l_63 = min(2047, max (-2048, sol 1 63))- 

guardl = (i==0) ; 

if (guardl) 

sol = sol_0; 
else 

sol = sat_l_63; 
^ blocks [ k*block_count+j ] [i] = sol; 

for (i=0; i<64; i++) { 
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block_data = blocks [k*block_count+j] [i] ; 
guardl = (i==0) ,- 
if (guardl) 

sum = block_data; 
else 

sum += block data; 

} 

/* mismatch control */ 

block_data = blocks [k*block count+j ] [63] ; 
if ((sum&l)==0) { 
block_data A = 1; 

} 

^ blocks [k*block_count+j] [63] = block_data; 

^duSon 66116 ^ l0 ° P te PartiaUy Unr ° 1Ied ' WMe ±e SeCOnd 0ne is a classical exam ^ ^ 
Loop 1 - Partial Unrolling 

^Mt l 7 P ^ % "J?* 10 ^,^^8 32-to-Sbit-conversion). Therefore the unrolling factor 
would be hmrted to 6 The next smaller divisor of the loop count is 4. Assuming this factor woufd be 
aken, another resign gets valid. The factor causes that four block_data value! are rea aTvTtten 
m one iteration Although this could be synthesized by means of shift register synthesis o7da£ 

tZ^X^T Sa '? B WrfteS W ° Uld CaUSe 6ither an Undefined result attite-b^Mf w^ient 
^o distmct IRAMs, or the merge of the values would half the throughput. Therefore thelnraUing 
factor chosen is 2, reaching the maximum throughput with minimum utilization. g 

valuts^ 6liminati0n rem ° VeS me statement for *» Parts representing the odd iteration 

for (i=0; i<64; i+=2) { // unrolled once 
// iteration 1=0,2,4.... 

block_data_0 = blocks [k*block_count+j ] [i] ; 

mat_data_0 = intra_q [i] ; 

sol_0_0 = block_data_0 « (3-dcjprec) ; 

soli 63_0 = {int)( block_data_0 * mat data 0 *mquant)»4; 
sat _ 1 _ 63 _° = min(2047, max (-2048, sol 1 63 oT) ; 

guardl_0 = (i==0); ~~ 

if (guardl_0) 

sol_0 = sol_0_0; 
else 

sol_0 = sat_l_63_0; 
blocks [k*block_count+j ] [i] = sol 0; 

// iteration i==l,3,5 

block_data_l = blocks [k*block_count+j ] [i+1] ; 

mat_data_l - intra_q [i+1] ; 

sol_0 = block_data_l « (3-dc_prec); 

s °l_l_f3 1 = (int) ( block_data_l * mat data 1 *mquant)»4; 
sat _ 1 _ 63 _ 1 = min(2047, max (-2048, sol_l 63 if) ; 
^ blocks [k*block_count+j] [i+1] = sat_l 63 1~ 

Loop2 - Sum Reduction 

As upon the block data write limits the reduction possibilities, therefore the code transforms to 
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for (i=0; i<64; i+=2) { 

block_data_0 = blocks [k*block_count+j ] [i] ; 
block_data_l = blocks [k*block_count+j ] [i+1] ; 
guardl = (i==0) ; 
if (guardl) 

sum = block_data_0 + block_data 1; 
else 

^ sum += block_data_0 + block_data 1; 

Loop Fusion 

The new loops can then be merged again, because still no dependence exists between them 
Furthermore the loop-invariant code following the loops is moved inside the loop body, producing a 
perfect loop nest. r 6 

for (j=0; j<block_count; j++) { 

for (1=0; i<64; i+=2) { // unrolled once 
block_data_0 = blocks [k*bldck_count+j ] [i] ; 
block_data_l = blocks [k*block_count+j ] [i+1] ; 
mat_data_0 = intra_q [i] ; 
mat_data_l = intra_q [i+1] ; 

// i== 0,2,4 

sol_0_0 = block_data_0 « (3-dc_prec) ; 

sol_l_63_0 = (int) ( block_data_0 * mat_data 0 *mquant)»4; 
sat_l_63_0 = min(204T, max (-2048, sol 1 63 0) ) ; 
guardO = (i==0) ; — 
if (guardO) 

sol_0 = sol_0_0; 
else 

sol_0 = sat_l_63_0; 

sol_0 = block_data_l « (3-dc prec) ; 
// i== 1,3,5 

so1 - 1 - 63 - 1 = (int) ( block_data_l * mat data 1 *mquant)»4; 
s °l_l = min(2047, max (-2048, sol_l 63 1)); ~ 
guard2 = (i == 62); ~ 
guard3 = ( (sum & 1) == 0) ; 
if (guard2 && guard3) 

sat_l_63_3 A = 1 
blocks [k*block_count+j] [i] = sol_0; 
blocks [k*block_count+j ] [i+1] = sat_l_63 1; 

As can be seen in the next table, these transformations have almost doubled the utilization and 
performance. 
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Parameter 


Value 


Vector length 


32 (64 16-bit values) 


Reused data set size 


- 


I/O IRAMs 


4 


ALU 


18 


BREG 


11 


FREG 


4 


Dataflow graph width 


8 


Dataflow graph height 


9+4 (converters) 


Configuration cycles 


13+32 



Unroll-and-jam 

As said above, the>loop nest is candidate for unroU-and-jam when interDrocedural val„« r»n„. 
equaUo L^T ,Tr% ? be *r i, ? bb by 2 ' Th,,S with an nnrolling X 

f0 fn^7? ; n j<b i^ k - C0U o t; j t =2) { 7/ unr °Hed and jammed once 
for (i=0; i<64; i+=2) { // unrolled once 
/ / common code 
mat_data_0 = intra_q [i] ; 
mat_data_l = intra_q [i+1] ; 
guardl = (i==0) ; 
guard2 = (i == 62) ; 
// j = 0,2, .. . 

block0_data_0 = blocks [k*block_count+j ] [i] ; 
block0_data_l = blocks [k*block_count+j ] [i+1] ; 

// i== 0,2,4 

sol_0_0 = block0_data_0 « (3-dc_prec) ; 

= (in ^ < block0_data_0 * mat_data 0 *mquant)»4; 
sat0_l_63_0 = min{2047, max(-2048, solO 1 63 o7) • 

if (guardl) — ' 

sol0_0 = sol0_0_0; 
else 

sol0_0 - sat0_l_63_0; 
// i== 1,3,5 

^!n~J- 63 -- TorSif ( block0 -data_l * mat_data 1 *mquant)»4; 
sol0_l = min(2047, max (-2048, solO 1 63 lT) - ~ 

if (guardl) ~ 

sumO = sol0_0 + sol0_l; 
else 

sumO += sol0_0 + solO 1; 
guard3 = ( (sumO & lj == 0) ; 
if (guard2 && guard3) 

sol0_l A = 1; 
blocks[k*block_count+j] [i] = sol0_0; 
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blocks [k*block_count+j] [i+1] = solO 1; 
// j == 1,3, ... 

blockl_data_0 = blocks [k*block_count+j+l] [i] ; 
blockl_data_X = blocks [k*block;count+j+l] [i+1] ; 

// i== 0,2,4 

soll_0_0 = blockl_data_0 « (3-dc_prec) ; 

soll_l_63_0 = (int) ( block_data_0 * mat_data_0 *mquant)»4- 
satl_l_63_0 = min(2047, max (-2048, soil 1 63 0) ) ; 

if (guardl) ~ 

soll_0 = soll_0_0; 
else 

soll_0 = satl_l_63_0; 
// i== 1,3,5 

so ^- 1 - 1 - 63 - 1 = (int) ( blockl_data_l * mat data 1 *mquant)»4- 
soll_l = min(2047, max (-2048, soil 1 63 i)) ; ~ 
if (guardl) ~ 

suml = soll_0 + soll_l; 
else 

suml += soll_0 + soll_l; 
guard4 = ((suml & 1) == 0); 
if (guard2 && guard4) 

soll_l ^= l 
blocks [k*block_count+j ] [i] = sol_0; 
blocks [k*block_count+j ] [i+1] = soil 1; 

The results of the version where unroU-and-jam is applied are shown in the following table. 



Parameter 


Value 


Vector length 


2*32 (2* 64 16-bit values) 


Reused data set size 




I/OIRAMs 


5 


ALU 


36 


BREG 


22 


FREG 


8 


Dataflow graph width 


2*8 


Dataflow graph height 


9+4 (converters) 


Configuration cycles 


13+32 



5.8.4 Final Code 

The RISC code contains only the outer loops control code and the preload and execute calls. Since the 
data besides the block data does not vary within they-loop, and the XPP FIFO initially sets the IRAM 
valuesto the previous preload, redundant load/store elimination moves the preloads in front of the/- 
loop. The same is done with the configuration preload. The RISC code looks then like: 

for (k=0; k<mb_height*mb_width; k++) { 
if (mbirifo [k] .mb_type & MB_INTRA) 
if (mpegl) 
/* omitted */ 
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else { 

XppPreloadConfig (_XppCfg_iquant_intra_mpeg2) ; 

XppPreload(2, &intra_q, 16); 
XppPreload(3, Smbinf o [k] .mqucint, 1); 
XppPreload(4, &dc_prec, 1); 

for (j=0; j<block_count; j+=2) { 

XppPreload(0, &blocks [k*block count + j], 32)- 
XppPreload(l, &blocks [k*block~count + j+1], 32 )- 
XppExecute ( ) ; ~ 

} 

^ XppSync(&blocks(k*block_count], 64 * block_count ) ; 

The configuration code reads: 

y oid XppCf g_iquant_intra_mpeg2 ( ) 

// IRAMs 

// Ck+blOCk ru° Unt+j] and bl °cks[k*block count+j+1], respectivelv 

/, Tr \ T?-T W i th s P litter to two 16 bit packets. respectively 
// aramO^lU] and iram0,l[i + l] are available concurrently 
short iram0[256], iraml [256] ; c y> 

// intra_q 

// £^M? eS V ith o?^ tter t0 4 8 " bit streams ^emerge to 2 streams 
// iram2[i] and iram2 [i+1] are available concurrently str eams. 
unsigned char iram2 [512] ; y> 

int iram3[128], iraia4 [128] ; // scalars mquant and dc_prec 

/ / temporaries 
int i; 

int sol0_0_0, sol0_0_l, solO 0, solO 1/ 
int soll_o_0, soll_0_l, soil 0, soll~i; 
mt sol0_l_63_0, sol0_l_63_l, satO 1~63 0- 
mt soll_l_63_0, soll_l_63 1, satl~l~63~0; 

int sumO, suml; ~ 

event guardl, guard2, guard3, guard4; 

for (1-0; i<64; i+=2) { // unrolled once 
/ / common code 
guardl = (i=0) ; 
guard2 = (i = 62) ; 

// j — 0,2,... 
// i= 0,2,4 

sol0_0_0 - iram0[i] « (3-iram3 [0] ) ; 

= (lnt)( iram0 ^ * iram2[i] * iram4 [ 0 ] ) »4 ; 
sat0_l_63_0 =min(2047, max (-2048, solO 1 63 0) ) ; 

if (guardl) 

sol0_0 - sol0_0_0; 
else 

sol0_0 => sat0_l_63_0; 
// i== 1,3,5 

^S-J- 63 -^ = (±nt) ( iram °Ei+l] * iram2[i+l] * iram4[0])»4- 
• sol0_l =min(2047, max (-2048, solO 1 63 1)). ram4LU])»4, 

if (guardl) 

sumO = sol0_0 + solO 1; 
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else 

suraO += sol0_0 + sol0_l; 
guard3 = ((sumO & 1) =~0); 
if (guard2 && guard3) 

solO_l ~= 1; 
iraml[i] = soll_0; 
iramlfi+l] = soll_l; 
/ / part for odd j values omitted 



^z 6 Li:::^ Mow ° { °°° *** ° f *° «■ ». **** «*„ «. 

5.8.5 Performance Evaluation 




blocKs[k T block^count +7f 



blocKstK*t>lock_count + jTTJ 



us 

128 



128 
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Although this is not as simple as the optimization before, the performance impact of almost 100 cycles 
seems to make it to a required feature for a compiler. J 

The simulation yields 1 10 cycles for the configuration execution, which must be doubled to scale it to 
rfthtSloS CyC,eS ' A mUltip,iCati0n by 6 yields 4116 61131 execu ti°n cycles for one iteration 

The results are summarized in the following table. 



configurations 
startup 


RATrt 
224 


"D^acne 
6 


£ptig 
iyou 


(Utalidrt 
ICache 






'^ J C^e)^ii; 


steady state 


o/z 


32 




1117 


2201^632 




.-'>;•!'•, 
i-'.: - .-eui' 

tth VV.** 




sum 


4256 















This table describes the worst case. All data must be loaded from RAM. When we assume that the 
configuration is loaded from cache, which is an accurate assumption because it mainly alters with the 
configuration for non intra coded blocks, the statistics look much better. Since the quantization matrix 
and the scaling constants also stay in the cache, their preloads do not burden the cache-RAM bus as 



configurations 
startup 



RAM 



*RAM - 



^^tache|M 




steady state- 
sum 1 



TTT7 



"32 



" ^ 220fe;/ & " : 



i.. ^".tr- 0 4 !v> j 



^ e /i n ^ Utili ?!j^^ Sh °T ? 1116 followin S Tho big differences with the estimated values 
the BREGs and FREGs result from the distributed counters. 



#3 



for 



Parameter 


Value 


Vector length 


2 *32 (2*64 16-bit values) 


Reused data set size 




I/O IRAMs [sum -pet] 


5-31% 


ALU[sum-pct] 


39-61% 


BREG [def/route/sum-pct] 


39/14/53 - 66% 


FREG [def/route/sum-pct] 


20/16/36-45% 
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62 


2 
















X 


Y 


u 



iramCiraml 



cuflVti i miwH 



iram3 




convert 32 to 8 
even iterations 
I I odd iterations 
I | accumulate 

mismatch control 



iramtyiraml 




I ■• ■■■■ ■ - ; 



Hnrl ^7 ° fthe ^ £G2 imerse Wantizationfor intra coded blocks Hie yellow and zreen 
blocks were produced by partial unrolling. The difference is that the green block mustmacZn^foftlT 
s P ec ia l ueratnn value 0. The blue block does the accumulation wheels the ZTcn uZZn^lf 

necessary. 
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5.9 MPEG2 codec - IDCT 

The /^algorithm (inverse discrete cosine transformation) is used for the MPEG2 video 
decompression algorithm. It operates on 8x8 blocks of video images in their frequency representation 
and transforms them back into their original signal form. The MPEG2 decoder contains a transform- 
function that calls idct for all blocks of a frequency-transformed picture to restore the original image. 

The idct function consists of two for-loops. The first loop calls idctrow - the second idctcol Function 
inhning is able to eliminate the function calls within the entire loop nest so that the numeric code is not 
mterrupted by function calls anymore. Another way to get rid of function calls in the loop nest is loon 
embedding that pushes loops from the caller into the callee. 



5.9.1 Original Code (idctc) 



/* two dimensional inverse discrete cosine transform */ 
void idct (block) 
short *block; 
{ 

int i; 

for (i=0; i<8; i++) 
idctrow (block+8*i) ; 

for (i=0; i<8; i++) 
idctcol (block+i) ; 

} 

The first loop changes the values of the block row by row. Afterwards the changed block is further 
transformed column by column. All rows have to be finished before any column processing can be 
started. "* ° 

x idctrow 8 x idctcol result 



Data dependence analysis detects true data dependences between row processing and column 
processing. Therefore processing of the columns has to be delayed until all rows are done The 
innermost loop bodies of idctrow and idctcol are nearly identical. They process numeric calculations 
on eight input values, column values in the case of idctcol and row values in the case of idctcol Eight 
output values are calculated and written back (as column/row), idctcol additionally applies clipping 
before the values are written back. This is why we concentrate on idctcol- 



/* column (vertical) IDCT 



. ' Pi 1 

dst[8*k] = sum c[l] * src[8*l] * cos ( — * ( k + - ) * l ) 
* 1=0 8 2 



where: c[0] = 1/1024 



* 

* ' c[1..7] = (1/1024) *sqrt (2) 

*/ 
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static void idctcol (blk) 

short *blk; 

{ 

int xO, xl, x2, x3, x4, x5, x6, x7, x8; 



/* shortcut */ 

if (!((xl = (blk[8*4]«8) ) | (x2 = blk[8*6]) I 

(x3 = blk[8*2]) | (x4 = blk[8*l]) | (x5 = blk[8*7]) | 
^ (x6 = blk[8*5]) | (x7 = blk[8*3] ) ) ) 

blk[8*0]=blk[8*l]=blk[8*2]=blk[8*3]=blk[8*4]=blk[8*5]= 

blk[8*6]=blk[8*7]=iclp[ (blk [8*0] +32) »6] ; 

return; 

} 

x0 - (blk[8*0]«8) + 8192; 

/* first stage */ 

x8 = W7*(x4+x5) + 4; 

x4 = (x8+ (W1-W7) *x4) »3; 

x5 = (x8-(Wl+W7)*x5)»3; 

x8 = W3* (x6+x7) + 4; 

x6 = (x8-(W3-W5)*x6)»3; 

x7 = (x8-(W3+W5)*x7)»3; 

/* second stage */ 
x8 = xO + xl; 
x0 -= xl; 

xl = W6*(x3+x2) + 4; 

x2 = (xl-(W2+W6)*x2)»3; 

x3 = (xl+ (W2-W6) *x3) »3; 

xl = x4 + x6; 

x4 -= x6; 

x6 = x5 + x7; 

x5 -= x7; 

/* third stage */ 
x7 = x8 + x3; 
x8 -= x3; 
x3 = xO + x2; 
xO -= x2; 

x2 =» (181* (x4+x5)+128)»8; 
x4 = (181* (x4-x5)+128)»8; 

/* fourth stage */ 

blk[8*0] = iclp[ (x7+xl)»14.] ; 

blk[8*l] = iclp[ (x3+x2)»14] ; 

blk[8*2] = iclp[ (x0+x4)»14] ; 

blk[8*3] = iclp[ (x8+x6)»14] ; 

blk[8*4] = iclp[ (x8-x6)»14] ; 

blk[8*5] = iclp[(x0-x4)»14] ; 

blk[8*6] = iclp[ (x3-x2)»14] ; 

blk[8*7] = iclpt (x7-xl)»14] ; 
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Wl ~ W7 are macros for numeric constants that are substituted by the DreDroce^or A™, ;* 



void init_idct{) 
{ 

int i; 



iclp = iclip+512; 

for (i 512; i<512; i++) 

iclpti] - (i<-256) ? -256 : ( (i>255) ? 255 : i) ; 



A special kind of idiom recognition, function recognition, is able to replace the 

™£ T h l emeDt b ? a com P iler that can be 

realized efficiently on the XPP. If the compiler features whole program memory 
abasing analysis, it is able to replace all uses of the iclp array with the call of the 
compiler known function. Alternatively a developer can replace the iclp array 
accesses manually by the compiler known saturation function calls. The 
illustration shows a possible implementation for saturatefval.n) as NML 

like iclp[i] by saturate(i,256). ; 



val 





r ra 




A B 

SORT" 

X Y 






(Ml 




"SORT 

X Y 






_ J 



saturate(val.n) 



The /* shortcut */ code in idctcol speeds column processing up if*i to x7 are eaual tn 

, S ^ rtC , Ut , V C ° de fa ^ W ^ idctc ° l has to be removed Lan^r The code 
snmpet below shows the inlined version of the «**™-loo P with additional cache inSSrS 

void idct (block) 
short *block; 
{ 

int i; 

XppPreloadConfig(__Xp P Cfg_idctrow); // Loop Invariant 

for (i=0; i<8; i++) { 
short *blk; 

int xO, xl, x2, x3, x4, x5, x6, x7, x8; 
blk = block+8*i; 

XppPreload(0, blk, 8/2); // 8 shorts = 4 ints 

XppPreloaddeand, blk, 8/2); // irmu is erased and assigned tQ blfc 
XppExecute ( ) ; 

} 

for (i=0; i<8; i++) { 
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5.9.2 Enhancing XPP utilization 



As mentioned at the beginning idct is called for all data blocks of a video image (loop in transform c) 
This circumstance allows us to further improve the XPP utilization. 

When we look at the dataflow graph of idctcol in detail we see that it forms a very deep pipeline 
_XppCfgJdctrow runs only eight times on the XPP which means that only 64 (8 times 8 elements of 
a column) elements are processed through this pipeline. Furthermore all data must have left the 
pipeline before the XPP configuration can change to the _XppCfgJdctcol configuration to go on with 
column processmg. This means that something is still suboptimal in the example. 

Pipeline Depth 

The pipeline is just too deep for processing only eight times eight rows. Filling 
and flushing a deep pipeline is expensive if only little data is processed with it. 
First the units at the end of the pipeline are idle and then the units at the begin 
are unused. 



DATA 



IDLE 



Pipeline Depth 





IDLE 


DATA 



Loop Interchange and Loop Tiling 

It is profitable to use loop interchange for moving the dependences between row and column 
processing to an outer level of the loop nest. The loop that calls the /^function in transform* on 
several blocks of the image has no dependence preventing loop interchange. Therefore this loop can be 
moved inside the loops of column and row processing. 
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// transform. c 



for (n=0; n<block_count ; ri++) { 
- idct (blocks [k*block_count+n]), // block_count is 6 or 8 or 12 



// idct.c 

short *block; 
{ 

int i; 

for (i=0; i<8; i++) 
► idctrow (block+8*i) ; 



} 



for (i=0; i<8; X++) 
idctcol (block+i) ; 



Constraints (Cache Sensitive Loop Tiling) 

Susy tr^s* £££ tr: tr, w * defme *• number of biocks *« 

^C/^/^^^^ * *■ subsequent 

during __XppCfgJdctro W fit into the cache. Lop tiling J£ J be 255 1>K * ^ Pr ° CeSSed 
s.ze so that the processed data fit into the c^^^^S^ ^ ^ t0 the cache 

5.9.3 NML Code Generation 

Dataflow Graph 
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Address generation, data duplication and data layout transformation: 

^fSFSf^** b ° dy ^ ^ t0 ^ Pr ° b,em ° f ^ * enerati ° n for — ** 

For idctrowwd idctcol we have to access one row/column per cycle to get a fully utilized Dioeline A, 
the rows/columns are packed, i.e. one row/column is packed into four word we use ^IrneTdai 

Sit? ^ / d T ibed , " S\ hardWare SeCti0n) » t0 enable 4 " times accesTwh ch I nTededt 

fetch a full row/column (eight short values) per cycle. neeaea to 

We use one counter per IRAM to realize address generation. The four counters are started with 
different offsets as they correspond to different elements of the fetched o7lt 
row/column are packed columns/rows). Therefore we implemented a counter macro that 2, 
configurable start, stop and increment value, and fits intone same PA^me ^ ^ Defied 
descnptions of the used macros are givenin the appendix. Detailed 

The fetched row/column has to be unpacked with split macros. A split macro splits packets of two 
shorts ui an input stream into two separate streams. Now eight injut values aVe processed to rte 
dataflow graph and eight result values (shorts) are created. processed to the 

HFO^rSAt n f °™ ting back * e re *ults is not needed, as we connect the eight result streams to 
^elRAMs which are mapped to one continuous address range. Before the results Tare written 
into the FIFO, packing is applied to provide packed input data for the next con?gurltion 

^suC^Spn S COmbi " ation of r f din g dat a duplicated IRAMS in RAM-mode, and writing the 
results mto FIFOs cause changes ,n the data layout of the input array. We have to ensure tlTt aftfraH 
data .processing .the original data layout is recovered. For this reason we need an exrrl con W a ion 
which restores the original data layout of the input array. This is done in XppcTuc^Z ^t 
also performs the saturation of idctcol to make the configuration for idctcolTbSdr 

ST 2 j UU !? at f t ! 16 data lay ° Ut Changes durin S the whole Process. After applying the last 
configuration the data layout is the same as before. wymg tne last 
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Figure 2 Data layout transformations in idct configurations 



150 



WO 2005/010632 



PCT/EP2004/006547 



5.9.4 Architectural parameters 



The following section shows the architectural parameters used by the compiler driver. This values are 
based on heuristics and may not exactly meet the final results. These are just start values for the 
optimizations process. ra ur ine 

_XppCfgjdctrow 



Parameter 


Value 


Vector length 


4 words 


Reused data set size 


4x8x4 words 


I/O IRAMs 


4 (data duplication)+8(output) 


ALU 


31(dfg)+8(pack) 


BREG 


32(dfg)+8(pack)+8(unpack)+4(addr.sel.) 


FREG 


0(dfg)+8(pack)+4(unpack)+4(addr.sel.) 


Dataflow graph width 


8 


Dataflow graph height 


10 


Configuration cycles 


128/4+10x2 "1 



_XppCfg_idctcol 



Parameter 


Value 


Vector length 


4 words 


Reused data set size 


4x8x4 words 


I/O IRAMs 


4 (data duplication)+8(output) 


ALU 


37(dfg)+8(pack) 


BREG 


35(dfg)+8(pack)+8(unpack)+4(addr.sel.) 


FREG 


0(dfg)+8(pack)+4(unpack)+4(addr.sel.) 


Dataflow graph width 


8 


Dataflow graph height 


10 


Configuration cycles 


128/4+ 10x2 
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_XppCfg_idctreorder 





Value 


Vector leneth 


4 words 


Reused Hflfji c*»f ciyp 


4x8x4 words 


I/O IRA Ms 


4 (data duplication)+8(output) 


ALU 


1 6( dfe\+X(rmc\c\ 


BREG 


8(dfg)+8(pack)+8(unpack)+4(addr.sel.) 


FREG 


0(dfg)+8(pack)+4(unpack)+4(addr.sel.) 


Dataflow graph width 


8 


Dataflow graph height 


2 


Configuration cycles 


128/4 + 2x2 



Total estimated optimal configuration cycles (considering no routing delays and pipeline stalls) for 
processing 4 blocks: 1 

2x(128/4+ 10x2)+ 128/4 + 2x2= 140 cycles 



5.9.5 Example source code after transformations: 

The following sources result from applying the optimizations discussed above. As the IRAM size is 
imally fixed to 128 words we can only process 4 blocks at once. The original source code has to be 
adapted to make this block size possible. 

transform 

Finally the Kfc/-function gets completely inlined in the itransform function of transform c If 
block count is equal to 4, and we assume that 32*4 words do not exceed the cache size, then we can 
transform the example into: 

/* inverse transform prediction error and add prediction */ 
void itransform(pred,cur,mbi, blocks) 
unsigned char *pred[] , *cur [ ] ; 
struct mbinfo *mbi; 
short blocks [ 3 [64],- 



{ 



int i, j, ii, ji, it, n , cc, offs, lx; 
short *block, *nextblock,- 
k - 0; 

for (j=0; j<height2; j+=16) 
for (i=0; i<width; i+=16) 
< 

if (block_count — 4) { // xpp execution only if blockcount is 4 
XppPreloadConf ig ( XppCf g_idctrow) ,- 

// hide cache miss with preloading next 4 blocks (if not last iteration) 
nextblock = blocks [(k+1) * 4]; 

if (i+16 >= width) XppPreloadd, nextblock, 128); 
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// do processing of actual 4 blocks 
block = blocks [k * 4]; 
// Input Data 

// IRAMs 0,2,4,6 = 0x55 ■= OblOlOlOl 
XppPreloadMultiple (0x55, block, 128); 
// Output Data 

XppPreloadClean ( 8, Sblock[0*16] , 16) 
XppPreloadClean ( 9, sblock[l*16] , 16) 
XppPreloadClean (10, Sblock[2*16] , 16) 
XppPreloadClean (11, sblock [3*16] , 16) 
XppPreloadClean (12, sblock [4*16] , 16) 
XppPreloadClean (13, sblock [5*16] , 16) 
XppPreloadClean (14, sblock[6*16] , 16) 
XppPreloadClean (15, sblock[7*l6] , 16) 

XppExecute ( ) ; 



// this one causes a read miss 



XppPreloadConf Ig ( xppcf g_idctcol) ; 

// Input Data 
// IRAMs 0,2,4,6 = 
XppPreloadMultiple ( 
// Output Data 
XppPreloadClean ( 8, 
XppPreloadClean ( 9, 
XppPreloadClean (10, 
XppPreloadClean (11, 
XppPreloadClean (12, 
XppPreloadClean (13, 
XppPreloadClean (14, 
XppPreloadClean (15, 

XppExecute ( ) ; 



0x55 = OblOlOlOl 
0x55, block, 128); 

Sblock [0*16], 16) 

Sblock[l*16] , 16) 

sblock [2*16], 16) 

Sblock[3*16], 16) 

Sblock[4*16] , 16) 

Sblock[5*16] , 16) 

Sblock[6*16] , 16) 

Sblock[7*16] , 16) 



XppPreloadConf ig ( XppCf g_idctreorder ) ; 

// Input Data 
// IRAMs 0,2,4, 6 = 
XppPreloadMultiple ( 
// Output Data 
XppPreloadClean) 8, 
XppPreloadClean ( 9, 
XppPreloadClean (10, 
XppPreloadClean (11, 
XppPreloadClean (12, 
XppPreloadClean (13, 
XppPreloadClean (14, 
XppPreloadClean (15, 



0x55 - OblOlOlOl 
0x55, block, 128); 

Sblock [0*16] , 16) 

sblock[l*16] , 16) 

Sblock[2*16] , 16) 

sblock [3*16] , 16) 

Sblock[4*16] , 16) 

Sblock [5*16] , 16) 

Sblock[6*16] , 16) 

Sblock[7*16] , 16) 



XppExecute ( ) ; 



) 



for (n=0; n<block_count; n++) { 

cc = (n<4) ? 0 : (nsl)+l; /* color component index */ 
if (cc==0) ( 

/* luminance */ 

if { (pict_struct==FRAME PICTURE) ss mbi[k].dct type) { 
/* field DCT */ ~ - yp ' 4 

+ 



offs = i + ((nsl)«3) 
lx = width«l; 



width* ( j + ( (ns2) »1) ) ; 



) 

else { 

/* frame DCT */ 

offs = i + ((nsl)«3) + width2* (j+( (ns2)«2) ) ; 
lx = width2; 

> 

(pict_struct==BOTTOM_FIELD) offs += width; 



if 

} 

else 
/* 
/* 
il 



{ 

chrominance */ 
scale coordinates */ 

( chr oma_f ormat=CHROMA4 4 4 ) 



? i 
? j 



i»l; 
j»l; 



jl = (chroma_formatl=CHROMA420) 

if ( (pict_struct=FRAME_PICT0RE) && mbi[k] '.dct type 
ss (chroma_format!=CHROMA420) ) { 
/* field DCT */ 
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offs = il + (ns8) + chrom_width*(jl+((n&2)»l)); 
lx « chrom_width«l; 

} 

else { 

/* frame DCT */ 

offs = il + (ns8) + chrom_width2* (jl+( (ns2)«2) ) ; 
lx = chrom_width2; 

) 

^ if (pic t_s truct==BOTTOM_FI ELD ) offs += chrom_width; 

// fallback to RISC execution if block count != 4 
if (block_count != 4) idct (blocks [k*block count+n] ) ; 

else xppsync (blocks [k*block_count+n] , 64/2); // ensure WB done for bl 
^ add_pred (pred [cc] +off s, cur [cc]+offs, lx, blocks [k*block_count+n] ) ; 

k++; 

} 

> 

__XppCfg_idctrow 



Sdefine Wl 2841 /* 
#define W2 2676 /* 
#define W3 2408 /* 
#define W5 1609 /* 
#define W6 1108 /* 
#define W7 565 /* 



2048*sqrt(2)*cos(l*pi/16) */ 

2048*sqrt(2)*cos(2*pi/16) */ 

2048*sqrt(2)*cos(3*pi/16) */ 

2048*sqrt (2) *cos (5*pi/16) */ 

2048*sqrt(2)*cos(6*pi/16) */ 

2048*sqrt(2)*cos(7*pi/16) */ 



/** XppCf g_idctrow ( ) 

* Does idct row calculation for 4 blocks 

* XPPIN: iram0,2,4,6 contains 4 blocks (data duplication) 
XPPOUT: iram8-15 contains transposed calc. results 

* / 

v °id XppCf g_idctrow ( ) {' 

// Input IRAMs in RAM Mode 

int iram0[128], iram2[128], iram4 [128] , iram6[128] ; 
// Output IRAMs in FIFO Mode 

int *iram8, *iram9, *iraml0, *iramll, *iraml2, *iraml3, *iraml4, *iraml5; 

int rO, rl, r2, r3, r4, r5, r6, r7, r8; 
int rOl, r23, r45, r67; 

// Counter offsets for parallel access 
int i0=0, il=l, i2=2, i3=3; 

int k; 

for(k=0; k<32; k++) { 

// Data layout of input array is: 

'/. ^°"0blk0 row7blk0, rowOblkl, row7blk3 

// (with 4 packed columns ([0,1], [2,3], [4,5], [6,7])) 

" 0 3, 28 31, 32 35, 124 127 

rOl - iram0[i0+=4] ; // row element 0 and 1 

r23 = iram2 [il+=4] ; // row element 2 and 3 

r45 = iram4[i2+=4] ; // row element 4 and 5 

r67 = iram6[i3+=4] ; // row element 6 and 7 

// Packed row elements have to be separated with splitl6 

splitl6(r01, r4, rO) ; 

_splitl6(r23, r7, r3) ; 

splitl6(r45, r6, rl) ; 

splitl6(r67, r5, r2); 

rl = rl«ll; 

rO = (r0«ll) + 128; /* for proper rounding in the fourth stage */ 
/* first stage */ 
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r8 = W7*(r4+r5); 

r4 - r8 + (Wl-W7)*r4; 

rS = r8 - (Wl+W7)*r5; 

r8 = W3* (r6+r7) ; 

r6 = r8 - (W3-W5)*r6; 

r7 = r8 - (W3+W5)*r7; 

/* second stage */ 



r8 
rO 
rl 



rO + rl; 
-= rl; 

W6* (r3+r2) ; 



r2 = rl - 



r3 
rl 
r4 



rl + 
r4 + 
= r6; 



(W2+W6) *r2; 
(W2-W6)*r3; 
r6; 



r6 = r5 + r7; 
r5 -= r7; 

/* third stage */ 
r7 = r8 + r3; 
r8 -= r3; 
r3 = rO + r2; 
rO -= r2; 

r2 = (181* (r4+r5)+128>»8; 
r4 = (181*(r4-r5)+128)»8; 

/* fourth stage */ 

// writel6 does vertical packing on row element streams (columns) 
// to have horizontal packing on columns for the next configuration 



_writel6 
_writel6 
_writel6 
_writel6 
_writel6 
_writel6 
_writel6 
writel6 



(iram8, k, 

(iram9, k, 

(iramlO, k, 

(iramll, k, 

(iraml2, k, 

(iraml3, k, 

(iraml4, k, 

(iraml5, k, 



(r7+rl) 
(r3+r2) 
(r0+r4) 
(r8+r6) 
(r8-r6) 
(r0-r4) 
(r3-r2) 
(r7-rl) 



»8) 
»8) 
»8) 
»8) 
»8) 
»8) 
»8) 
»8) 



_XppCfg_idctcol 



#define Wl 2841 /* 2048*sqrt(2) *cos (l*pi/16) */ 

#define W2 2676 /* 2048*sqrt (2) *cos (2*pi/16) */ 

#define W3 2408 /* 2048*sqrt (2) *cos (3*pi/16) */ 

#define W5 1609 /* 2048*sqrt(2) *cos (5*pi/16) */ 

#define W6 1108 /* 2048*sqrt(2) *cos (6*pi/16) */ 

#define W7 565 /* 2048*sqrt (2) *cos (7*pi/16) */ 

/** XppCf g_idctcol ( ) 

* Does idct column calculation for 4 blocks 

* XPPIN: iram0,2,4,6 contains 4 blocks (data duplication) 
XPPOUT: iram8-15 contains transposed calc. results 



v °id XppCf g_idctcol ( ) { 

// Input I RAMs in RAM Mode 

int iram0[128], iram2[128], iram4[128], iram6[128]; 
// Output I RAMs in FIFO Mode 

int *iram8, *iram9, *iraml0, *iramll, *iraml2, *iraml3, *iraml4, *iraml5; 

int cO, cl, C 2, c3, c4, c5, c6, c7, c8; 
int cOl, c23, c45, c67; 

// Counter offsets for parallel access 
int i0=0, il=l, i2=2, i3=3; 

int k; 
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for(k=0; k<32; k++) { 



col7blk3 
124 127 



// Data layout of input array is • 
// colOblkO, col0blk3, collblkO, .... 

// (with 4 packed rows ( [0, 1] , [2, 3] , [4, 5] , [6, 7 j ) ) 
0 3 12 15, 16 19, 

coi = iramO[iO+=4]; // column element 0 and 1 

c23 = iram2[il+=4]; // column element 2 and 3 

c45 = iram4(i2+=4] ; // column element 4 and 5 

c67 = iram6[i3+=.4] ; // column element 6 and 7 

// Packed column elements have to be separated with __splitl6 

splitl6(c01, c4, c0) 

splitl6(c23, C 7, C 3) 

splitl6(c45, c6, cl) 

splitl6(c67, c5, c2) 



cl = cl«8; 

CO = (c0«8) + 8192; 



/* 
c8 
C4 
c5 
c8 
c6 
c7 



first stage */ 

= W7*(c4+c5) + 4; 

= <c8+(Wl-W7)*c4)»3; 

- <c8-(Wl+W7)*c5)»3; 

= W3* (C6+C7) + 4; 

= (c8-(W3-W5)*c6)»3; 

= (c8-(H3+W5)*c7)»3; 



/* second stage */ 
c8 = cO + cl; 
cO cl; 

cl - W6* (c3+c2) + 4; 

c2 = (cl-(W2+W6)*c2)»3; 

c3 = (cl+(W2-W6)*c3)»3; 

cl = c4 + c6; < 

c4 -= c6; 

c6 - cS + c7; 



c5 



c7; 



/* third stage */ 
c7 = c8 + c3; 
c8 -= c3; 
c3 = cO + c2; 
CO -= c2; 



c2 
c4 



(181* (c4+c5)+128)»8; 
(181*(c4-c5)+128)»8; 



/* fourth stage */ 

// T^lt^t d ° eS vertical Packing on column element streams (blocks) 
// to have horizontal packing on blocks for the next conjuration 



_writel6 
_writel6 
_writel6 
_writel6 
_writel6 
_writel6 
~writel6 
_writel6 



(iram8, k, 

(irara9, k, 

(iramlO, k, 

(iramll, k, 

(iraml2, k, 

(iraml3, k, 

(iraml4, k, 

(iraml5, k, 



(c7+cl)»14) 
(c3+c2)»14) 
(c0+c4)»14) 
(c8+c6)»14) 
(c8-c6)»14) 
(c0-c4)»14) 
(c3-c2)»14) 
(c7-cl)»14) 



__XppCfg_jdctreorder 



!^ e ^ De min CA ' B> (((A)>=(B))?(A):( B )) 
tdefine max(A,B) U (A)>=(B)) ?(B) : (A) ) 

/ * * _XppCf g_idctreorder ( ) 
* v«o™ ate f and restores original data layout 

XPpin: iram0,2,4, 6 contains 4 blocks (data duplication) 
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XPPOOT 



: iram8-15 contains transposed calc. results 



void XppCf g_idctreorder ( ) { 

// Input IRAMs in RAM Mode 

int iram0[128], iram2tl28], Iras* [128], iram6[128]; 
// Output IRAMs in FIFO Mode 

int *iram8, *iram9, *iraml0, Hramll, nraml2, *iraml3, *iraml4, *iraml5; 

int bOl, bOh, bll, blh, b21, b2h, b31, b3h; 
ant bOll, bOlh, b231, b23h; 

// Counter offsets for parallel access 
int i0=0, il=0+64, 12-1, i3=l+64; 
int k; 

for(k=0; k<32,- k++) { 



// Data layout of input array is: 
// rowOcolO, row0col7, rowlcolO, 

// (with 2 packed blocks (0, 1, 2, 3) ) - 
" 0 1, 14 15, 16 17, 



• , . • . , row7col7 
124 127 



bOll = iramO [i0+=2] ; 
bOlh = iram2 [il+=2] ; 
b231 = iram4 [i2+=2J ; 
b23h = iram6 [i3+=2] ; 



// fetch lower half of block 0 and 1 

// fetch upper half of block 0 and 1 

// fetch lower half of block 2 and 3 

// fetch upper half of block 2 and 3 



// Packed blocks have to be separated with _splitl6 

_splitl6(b011, bll, bOl) 

splitl6(b01h, blh, bOh) 

_splitl6(b231, b31, b21) 
splitl6(b23h, b3h, b2h) 

// ho-r^nnJf/ 063 ^ 6 ^ 1031 P ackin 9 ° n ^ock streams to have 
// horizontal packing on rows as in the original data layout 



_writel6 
_writel6 
_writel6 
_writel 6 
~writel6 
_writel6 
_writel6 
writel6 



(iram8, k, min (max (bOl, -256) , 255) ) 

(iram9, k, min (max (bOh, -256) , 255) ) 

(iramlO, k, min (max (bll, -256) , 255) ) 

(iramll, k, min (max (blh, -256) , 255) ) 

(iraml2, k, min(max (b21, -256) , 255) ) 

(iraml3, k, min (max (b2h, -256) , 255) ) 

(iraml4, k, min (max (b31, -256) , 255) ) 

(iraml5, k, min (max (b3h, -256) , 255) ) 



5.9.6 Performance Evaluation 

To guarantee fair conditions for this example, we have to compare the .total amounts of cycles the 

XP^ScZ^T °A a Jf d am ° Unt ° f ^ ° nce 0n *e reference sy^ ^ once on t 
RNT Comb,natlon - As determining cycle times of single configurations for execution on me 

^x^ r :^ c bad resuIts for execution ° n the reference ■ - ™: 

Data transfer times 
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configurations, they are never y^^^B^O^ST^ Miwiraed by the ^sequent 
to be written back to RAM. nty ^ ° UtpUt Created b y ^treorder has 



Final performance results for the first Iteration 



configurations ! 
Idctrow " 


Patsr 


iD&aclie 
' 32 


iU248 








iCorej Cache}{@$ft 


idctcol 


u 


32 


1UB4U 


1461 
1513 


»PQU 


1513&#£&j$ 
714^5b|t) 






Iddreorder 

"all configurations 


u 

~8SU 


32 

ye 


bU4U 
2bW\6 


714 
WW 



Final performance results for the subsequent iterations 
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5.10Wavelet 



5.10.1 Original Code 

#define BLOCK_SIZE 16 
#define COL 64 
#define ROW 1 

void f orward_wavelet ( ) 
{ 

int i,nt, *dmid; 

int *sp f *dp, d_tmp0, d_tmpl, d_tmpi, s tmpO, s tmpl; 
int mid, ii; ~ 
int *x; 

int s [256] , d[256] ; 

for (nt=COL; nt >= BLOCK_SIZE; nt»=l) { 

for (i=0; i < nt*COL; i+=COL) { /* column loop nest */ 

x = &int_data [i] ; 
mid = (nt » 1) - 1; 

s[0] = x[0]; 

d[0] = x[ROW] ; 

s[l] = x[2]; 

s [mid] = x[2*mid] ; 

d[mid] = x[2*mid+ROW] ; 

d[0] = (d[0] «1 ) - s[0] - s[l]; 
s[0] = s[0] + (d[0] » 2); 

d_tmp0 = d[0] ; 
s_tmp0 = s [1] ; 

for(ii=l; ii < mid; ii++) { 
s_tmpl = x[2*ii+2]; 

d_tmpl =( (x[2*ii+ROW] ) « 1) - s_tmpO - s tmpl; 
d[ii] = d_tmpl; ~ 
s[ii] = s_tmp0 + ( (d__tmp0 + d_tmpl)»3); 
d_tmp0 = d_tmpl; 
s_tmp0 = s_tmpl; 

} 

d[mid] = (d[mid] - s [mid] ) « 1; 

s[mid] = s[mid] + ((d[mid-l] + d[mid] ) » 3) ; 

for(ii=0; ii <= mid; ii++) { 

x[ii] = s [ii] ; 
x[ii+mid+l] = d[ii] ; 

} 

} 

for (i=0; i < nt; i++) { /* row loop nest */ 

x = &int_data [i] ; 
mid ■= (nt » 1) - 1; 

s[0] = x[0]; 
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d[0] = x[COL] ; 

s[l] = x[COL«l] ; 

s[mid] = x[ (COL«l) *mid] ; 

d[mid] = x[ (COL«l) *mid+COL] ; 

d[0] = (d [0] « 1) - s[0 ] - s[1]; 
s[0] ■= s[0] + (d[0] » 2) ; 

d_tmpO = d[0] ; 
s_tmp0 = s [1] ; 
for(ii=l; ii < mid; ii++) { 
s_tmpl = x[2*COL* (ii+1) ] ; 

«itT -VST* 1 " 0011 ~ 11 " '- tmp0 - w 



} 



_tmpO = d_tmpl; 
s_tmpO = s_tmpl; 



d[mid] = (d[mid] « i) - (S [mid] « 1); 
sfmxd] = s[mid] + ((d [mid-1 J + d[mid] ) » 3) - 

for(ii=0; ii <= m id ; ii++) { 
x[ii*COL] = S [ii] ; 
x[ (ii+mid+1) *COL] =d[±i] ; 



-X^l" ! outcast loop with induction 

four innermost tot^SS£STS£S c^T T* " u U " nd ^ 3 is buiIt * *» 
analysis, that nt wi£ take X^^'J^T^^^ ^ T ° f ^ ^ 

column-loop nest S and handTe di^^^^^ £? J^ST*" ^ « 

5.10.2 Optimizing the Column Loop Nest 

£tx™tS ssr by d K ead code eiimination — 

readability reasons we ^^cun^dvtl^L W ™ ^ foU ° Wm e loo P ^ 
more common index/ % Van,,Mo nameS * s0 > «V <». and /; by the 



for (i=0; i < 64*64; i+=64) { 
x = &int_data[i] 

s[0] = X [0] ; 
dfO] = x[l]; 
S[l] = x[2] ; 
s[31] « x[62]; 
d[31] » x[63]; 

d[0] = (d[0] « l) - s[0 ] , s[1] . 
s[0] = S [0] + (d[0] » 2); 
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dO = d[0] ; 
sO = s[l] ; 

for (j=l; j < 31; j++) { 

d[j] =((x[2*j+l]) « 1) - a -0 - x[2*j+2]; 
s[j] = sO + ((dO + d[j]) » 3); 
dO - d[j]; 
sO = s[j]; 

} 



d[31] = (d[31] - s[31]) « 1; 

s[31] ■= s[31] + ((d[30] + d[31]) » 

for (j=0; j <= 31; j++) { 
x[ jl = s [j] ; 
x[j+32] = d[j] ; 



3) ; 




Figure 3 Dataflow graph of the innermost loop nest 



From the dataflow graph of the first innermost loop nest (induction variable/-) the compiler computes 
an optimization table. In this stage of optimization it just counts computations and neglects the 
secondary effort necessary for IRAM address generation and signal merging. If there are different 
possibilities to perform an operation on the XPP in this initial stage, the compiler schedules ALU with 
highest priority. Inputs from or outputs to arrays with address differences of less than 128 words 
(IRAM size) are always counted as coming from the same IRAM. Hence the first innermost loop 
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needs three input IRAMs (sO, dO, xf2+J+lJ and x[2*j+2]) and two output IRAMs (* J> Th* * * 
innermost loop needs two input IRAMs d) and one output IRAM i^jtfx^32J)^ ^ 



Parameter 


value 


Vector length 




Reused data set size 




I/O IRAMs 


u 31 + 20 


ALU 


5 


BREG 


1 (shift right by three) 


FREG 


0 


Dataflow graph width 


2 


Dataflow graph height 


6 


Configuration cycles 


5*30 + 2 



nearly the same iteration count as the first «Ts^«££i£?Sl tn . 'nnermost loop has 

curing neranon j-i ot the second innermost bop for instance */??7 nf i 
overwritten, while during iterations; 6 of the ™«*.xlJJJ of the original x array is 

available. The cache JS^^ t^m^^^ on ^\^ of ^ »«* be 
problem. One cache memory^^ to this 

writing. As the IRAM filling from the cache is friSered Tb?S^ T ™? mg L ^ ° ne for 
DRAM is filled once before the config^ratKn is exS the read -° nl y 

for (i=0; i < 64*64; i+=64) { 

ant t[64]/ // Temporary array built by output IRAM 

x «= &int_data[i] ; 



s[0] . 
d[0] . 
s[l] . 
s[31] • 
d[31] .= 



x[0] 
xtl] 

x[62]; 
X[63]; 



d[0] = (d[0] « 1) - s[ o] - a[l]f 
s[0] = s[0] + (d[0] » 2) ; 



dO = d[0]; 
SO <= s[l] ; 
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for (j=l; j < 31; j++) { 

d[j] =(x[2*j+l] « 1) - s0 - x[2*j+2]; 

s[j] = sO + ((d0 + d[j]) » 3); 

dO = d[j]; 

sO = s[j] ; 

t[j] = s[j]; 

t[j+32] = d[j] ; 



// The following array copy code is implicitely 
// done by the cache controller, 
for (j-l; j < 31; j++) { 

x[j] = t [j] ; 

x[j+32] = t[j+32]; 

} 



d[31] - (d[31] -s[31]) 

s[31] = s[31] + ( (d[30] 

x[0] = s[0] ; 

x[32] = d[0] ; 

x[31] = s[31]; 

x[63] =■ d[31] ; 



« 1; 

+ d[31] ) » 



3) 



Next the compiler tries to reduce IRAM usage. Data dependence analysis shows that the values of 
array s which are manipulated within the innermost loop are not used outside of the loop d[301 is the 
only value which depends on values of array d calculated within the innermost loop. Thus the 
compiler replaces d[30] by t[62] outside of the loop. Now it is legal that array contraction replaces 
arrays s and d within the loop by scalars si and dl. A further IRAM reduction is done by using a 
common IRAM for the input scalars sO and dO (array sd) . The tradeoff for this TRAM saving is a 
minor extra effort for the distribution of the two values to their dedicated PAE locations on the XPP 
We arrive at : 



for (i=0; i < 64*64; i+=64) { 

int t[64]; // Temporary array built by output IRAM 

x = &int_data [i] ; 



s[0] = x[0]; 
d[0] = x[l] ; 
s[l] = x[2]; 
s[31] = x[62] ; 
d[31] = x[63]; 



d[0] = (d[0] « 1) - a[0] - s[l]; 
s[0] = s[0] + (d[0] » 2) i 

dO - d[0] ; 
sO = s[l] ; 



// The following loop is executed on the XPP. 
for (j=l; j < 31; j++) { 

dl =((x[2*j+l]) « 1) - sO - x[2*j+2]; 

sl = sO + ( (dO + dl) » 3) ; 

dO = dl; 

sO = sl; 

= Sl; 
x[j+32] = dl; 



163 



WO 2005/010632 



PCT/EP2004/006547 



// The following array copy code is implicitely 
// done by the cache controller, 
for (j=l; j < 31; j++) { 
= t[j]; 
x[j+32] = t[j+32]; 

} 



d[31] = (d[31] - s[31]) « 1; 

s[31] = s[31] + ((t[62] + d[31]) » 3) ; 

x[0] = s[0] ; 

x[32] = d[0]; 

x[31] = s[31]; 

x[63] = d[31]; 



with an optimization table 



Parameter 


Value 


Vector length 




Reused data set size 




I/O IRAMs 


21+ lO 


ALU 


5 


BREG 


1 


FREG 


0 


Dataflow graph width 


2 


Dataflow graph height 


6 


Configuration cycles 


5*30+2 



The t nnermost loop does not exploit the XPP to capacity. So the compiler tries to unroll the innermost 
oop. For the computation of the unrolling degree it is necessary to have a more detailed "eZStf 
mjHSZT* GOm ^^r**> J - e - * e compiler estimates the address computation network for the 
IRAMs. Array x must provide two successive array elements within each loop iteration. This is done 

fc^r^ ^"l Whh addreSS 3 ™ A closin S with addres * 62 (1 FREG, 1 BREG) The 
IRAM data is then distributed to two different data paths by a demultiplexer (1 FREG) which toggles 
with every incoming data packet between the two output lines (1 FREG, 1 BREG) The sanie 
tZX P t X %^ [°SS le " e ^ rk » "iry for the array sd. A merger (1 FREG, 1 BREG) is used 
to fetch the first data packet fromaO and all others from si. A second one merges dO and dl. Finally 

S^STS £f G ' 2 « ^ addr6SSeS ' ^ first starti »S wi * address and 

4e second with address 33. The resulting data as well as the addresses are crossed by mergers which 
toggle between the two incoming packet streams (4 FREG, 2 BREG). This results in the following 
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Pn ro m a 


Value 


Vector length 


30 


Reused data set size 


- 


I/O IRAMs 


21+1 0 


ALU 


5 


BREG 


10 


FREG 


13 


Dataflow graph width 


2 


Dataflow graph height 


6 


Configuration cycles 


5*30+2 



The compiler computes from the maximum number of FREGs ( R0\ ani i fi™ ■ • > 
FREGs per innermost loop (13) an unrolling degree ^^^w^oS^iTT^ TS? ° f 
use per innermost loop is 3 compared to f 6 aSe^RAMs Fro m 1? t f ^ *" 1RAM 
unrolling degree eaual to 5 (= \f,n\ Th a *1~7- , s ' * rom mis > me compiler computes an 



/** XppCfg_wavelet64 ( ) 

* Performs four innermost loops of the 

* in parallel. 



wavelet transformation 



XPPIN: 



XPPOUT: 



iramO 

iraml 

iram2 

iram3 

iram4 

iram5 

iram6 

iram? 

iram9 

iramll 

iraml 3 

iraml5 



s0_0, d0_0 

64 integers of the 

s0_64, d0_64 

64 integers of the 

s0_128, d0_128 

64 integers of the 

s0_192, d0_192 

64 integers of the 

64 integers of the 

64 integers of the 

64 integers of the 

64 integers of the 



X 


array 


of 


iteration 


i 


X 


array 


of 


iteration 


i+64 


X 


array 


of 


iteration 


i+128 


X 


array 


of 


iteration 


i+192 


X 


array 


of 


iteration 


i 


X 


array 


of 


iteration 


i+64 


X 


array 


of 


iteration 


i+128 


X 


array 


of 


iteration 


i+192 



void __XppCfg_wavelet64 () 

int iram0[128], iram2[128] 
int iraml [128], iram3[128] 
int iram9[128], iramll [128] 

int tmp_d0_0 - iramO[0],- 
int tmp_s0_0 = iram0[l] ; 

int tmp_d0_64 - iram2[0J; 
int tmp_s0_64 = iram2[l] ; 

int tmp_d0_128 = iram4[0]; 
int tmp_s0__128 = iram4[l] ; 



i iram4 [ 
r iram5[ 
, iraml3[ 



128], iram6[128]; 
128], iram7 [128] ;' 
128], iraml5[128] ; 



int tmp_d0_192 
int tmp_s0_192 



iram6 [0] ; 
iram6 [ 1 ] ; 
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for(j=l; j<31; j++) { 

int tmp_dl_0, tmp_dl_64, tmp_dl_128, trap dl 192; 
int tmp_sl_0, tmp_sl_64, tmp_sl_128, tmp_sl~192; 

tmp_dl_0= ((iraml[2*j+l]) « 1) - tmp sO 0 - iraml [2+1+21 • 
tmp_sl_0 = ( (tmp_dO_0 + tmp_dl 0) » 3) + tmp sO 0; 
iram9[ 3 ] - tmp_sO_0 = tmp si 0; ~ 
iram9[j+32] = tmp_dO_0 = tmp_dl~0; 

tmp_dl_64 - (Uram3[2*j+1]) « i, _ tmp a0 64 _ iram3 r 2 * i+2 , . 
tmp_sl_64 = ((tmp_d0_64 + tmp_dl_64) »~3)~+ tmp sO 64 • 
iramlltj] = tmp_s0_64 = tmp_sl 64; ~ ' 

iramll[j+32] = tmp_d0_64 = tmp_dl_64; 

tmp_dl_128 = <(iram5[2*j+l]) « 1) _ tmp sO 128 - iramS [2*1+21 • 
t-P_3l 128 - ((tmp_d0_128 + tmp_dl_128) >> 3) + tmp sO 128 ; ' 
o.raml3[ 3 ] •= tmp_ 3 0_128 = tmp_sl 128; ~ 
iraml3[j+32] = tmp_d0_128 = tmp_dl_128; 

tmp_dl_192 - <(iram7[2*j+l]) « 1) - tmp sO 192 - iram7 [2*1+2] • 
tmp_sl 192 = ((tmp_d0_192 + tmp_dl_192) >> 3) + tmp sO i 92 ; ' 
iraml5[j] «= tmp_s0_192 = tmp si 192; ~ 
^ iraml5[j+32] = tmp_d0_192 = tmp_dl~192; 

} 

Two similar configurations handle the cases where nt = 32 and nt = 16. They are not shown here as 
they differ only in the number of loop iterations (15, and 7, respectively). 

At this point some remarks about the further translation of the configuration code to NML code are 
Se 3 B ltTS^n°tf eratl0nal Cl T ent T S T A COImections defined by the dataflow graph of 
of T 4pp f t • ? 15 ,n f° mi ? lete - K d ° eS " either include which e,ement to P^ce in which cell 

colnS T Y ? 1 r- ng) ; n ° r tt aUow an ^ hoc decision which °P erati °n to execute in which 
computational unit. It ,s for instance, possible to perform a subtraction in an ALU or in a BRED 
These decisions are very delicate, as they highly influence the performance of the generated XPP code 
ha the current example the following strategy is applied. The first thing to notice is the cycle in Ae 
dataflow graph. It defines a critical path as it decides how many XPP cycles are at least necessary to 
provide a new output value. Counting along the dataflow cycle we find five oi^ntttoDrirt^SfaS 

e?^ * ne 1 : 7** SUh r CU addI > Shifi right * 3 > - d add2 ' ^ wo -t case 2 umptn 
is that every operational element takes one XPP cycle. This explains the 5*30 +2 configuration cvdes 

£2? TT?" ^ ?7 pr ° VideS BREG e,ements which can be — to^SaS wSt a 

Sr^lT 18 Sw? '*? by 3 - ™ S ° perati0n Can be done in a BRE « We define 
^xJffiS^ uZ?^ • T V Both "Coring additions are chosen as ALU operations 
(2 XPP eye es). The subtraction is done in a BREG with NOREG property (0 XPP cycles) and the 
merge is only possible as FREG (1 XPP cycle). Hence we obtain a minimum of three XpFcycles per 

L V ofSj ?5?L£* K a " ° Perati0nal elementS ° f <* C,e te Placed w£ o'ne 

hne of the XPP array, and within a bus section free of switch objects of the horizontal XPP buses 

Hence the compiler must definitely choose the placement of this critical code section^erw™ a 
severe deterioration of the performance is inevitable. vcnerwise a 



5.1 0.3 Optimizing the Row Loop Nest 

The optimization of the row loop nest starts along the same lines as the column loop nest After nre- 
processmg, application of copy propagation followed by dead code elimination over^ S S 
tTe seTnnT f ° r < 64 > md ™ d < 31 > *» ^mpiler peels off the first andlast ite^Sn of 

second one P ' *° between the two innermost loops after the 
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for (i=o ; i < 64; i++) { 
x = &int_data [i] ; 

S[0] = X [0]; 
d[0] » X[64J; 
s[l] = x[64*2]; 
S [31] = x[64*62] ; 
d[31] = x[64*63] ; 

d[0] = ( d [0] « 1) _ s[0 i ^ s[1 , . 
s[0] = 8[0 ] + (d [0] » 2 ); ] ' 

dO = d[0]; 
SO = s[l] ; 

for (j=i ; j < 31 . { 

sill = SO + ((dO + dl) » 3». 1 3+2)]. 



} 



dO = d[j]; 
sO = s[j] ; 



for (j-i, j < 31 . { 
x[64*j] = s[j]; 

} x[64*(j+32)] - d [j]; 

d[31] - ( d [31] « i) - (a[31] <<c 1) . 
VllV Zlll]] + ««I"*«J + d[31],'» 3); 
x[32] = d[0],' 
x[64*31] - s[31]; 
X[64*63] = d[31]; 



£S iTrK~ p r S t ss^tr for the ** «— 

innermost loop. Hence the compiler reordeTthTZ ? f mem0,y Ae &st itera ti°" of the 
A similar problem ^JTS^iSJSS^i ' T ^ f,rst k"*™"* 'oop 

The new arrays surfers from mTsaTe ^nTS ^-f *" °° mpiler also Educes arrayv 

loop fusion panting S^S^^^^ * *o previous section. Z 

guarantees correctness of the tLsformed sou^code a*™***" of a temporary array/ which 



for (i=0; i < 64; 

int y[64], t[64]; 

x = &int__data[i] ,- 

s[0] = x[0] ; 
d[0] = x[64J; 
s[l] = x[64*2]/ 
s[31] - x[64*62] ; 
d[31] = x[64*63]; 
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d[0] = (d[0] « 1) - s[0] - a[l], 
s[0] = s[0] + (d[0] » 2) ; 

dO = d[0] ; 
SO = s[l] ; 



// Column to row transfer, 
for (j=l; j < 31; j++) { 

y[2*j + l] = x[64*(2*j+l) ] ; 

y[2*j+2] = x[64*(2*j+2) ] ; 

} 



// The following loop is executed on the XPP 
for (j=l; j < 31; j++) { 

d[j] -{(y[2*j+lj) « 1) - so - y[2*j+2]; 

s[j] = sO + ( (dO + dl) » 3); 

dO = d[j]; ■ ' • 

SO = S [ j ] ; 

t[j] = s[j]; 

t[j+32] - d[j]; 



// The following array copy code is implicitely 
// done by the cache controller, 
for (j=l; j < 31; j++) { 

y[jl = t[j] ; ' 

y[j+32] = t[j+32]; 

} 



// Row to column transfer, 
for (j=l; j < 31; j++) { 
x[64*j] - y[j]; 

x[64*(j+32)] = y[j+32] ; 

} 



d[31] - (d[31] « l) - (s[31] « 1); 
s[31] = s[31] + ((x[64*62J + d[31]) » 3); 
x[0] = s[0]; 
x[32] = d[0]; 
x[64*31] = s[31]; 
x[64*63] = d[31] ; 



After loop fiision tiie second innermost loop looks exactly like the loop handled in the previous section 
and can thus use the same XPP configuration. The two surrounding reordering loops actually perform 
a transposition of a column vector to a row vector and are most efficiently executed on the RISC. 

5.10.4 Final Code 

S^T 051 - l0 ° P , iS com P lete, y u ™ olled wni <* produces six inner loop nests (induction variable/). 
Each of these mner loops is unrolled four times with the wavelet XPP configuration in the center The 
unrolling of Ae inner loops requires a bundle of new local variables whose names are suffixed by the 
original iteration numbers Array variables with constant array indices are replaced by scalar variables 
tor readability reasons. s[0], for instance, becomes s0_0, s0_64, s0J28, s0_192. 

One further loop transformation is necessary to facilitate the work of the cache controller. When the 
wavelet configuration finishes, a computation result in array* of each iteration / is used in the 
I u S l " °P eration is i^ary after each ^£*ec« te which forces 

a wrrte-back of the IRAM contents to the first level cache. The RISC must wait until the write-back 
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finishes. However, if the compiler splits the loop zStexXppExecute, it is possible to prepare the RISC 

t^^SrST™*?^.*" Write - baCk ° perati0n ° f con *°» er foSS 

effect). The cost for the loop distribution is the expansion of some scalar variables, i e all scalS 

2,T;r ed t bef ° r t and ^^rXppExecuie must be expanded to array variable" Henae 
variable s0_0, for instance, becomes s0_0[l 6]. nence 

Loop distribution is applicable for both, the column as well as the row loop nest. However in the case 
of tie row loop nest th.s requires an array for each vector element of,? i.e. ^acrnX becomes a 
matrix. In order to reduce the memory demand the compiler does no complete KH^ESI tt 
rather executes the two loops shifted by a memory requirement factor. This loop opZizaSon is caUed 
shifted loop merging (or shifted loop fusion) [7]. The memory requirement factor i Xsen to a value 
of four, as the architecture provides three IRAM shadows. 1Ue 

c^defof ^only VOlUmin ° US l0 ° P mmUi, W we P resent *• optimized RISC 

void forward wavelet ( ) 
< 

int i, j, k; 

int s0_0[4], a31_0[4], si 0; 
ints0_64[4], s31_64t4), sl~64; 
int s0_128t4], s31_128[4], si 128; 
int s0_192[4], s31_192[4], sl~192; 
int d0_0[41, d31_0[4]; 
intd0_64[4], d31_64[4]; 
int d0_128t4], d31_128[4]; 
int dO_192[4J, d31_192f4]; 

int 3d_0[2], sd_64[2], sd_l-2B[2], sd 192[2],- 

xnt y_0[64] [4], y_64[64][4], y_129 [64] [4] , y_192 [64] [4] ; 

for <i-0, 1 < 16*256; i+=25G) { /* nt -64, column loop */ 

if (i < 16*256) { /* XppPreload and XppExecute V 

XppPreloadConf ig ( XppCf g_wavelet64) ; 

k = {i / 256) % 4; 

x = &int_data[i] ,- 
s0_0[k] - x[D]; 
d0_0[k] - x[l]; 
sl_0 = x[2]; 

s31_0[k] = x[62]/ 
d31_0[k] - x[63]; 

S ^f?! = d0 _°tlc] = (d0_0[k] « 1) - sO 0[k] - si 0; 
sd_0(l] = s0_0[k] - (dO 0[k] » 2) + s0~0[k]; 
XppPreload (0, sd 0, 2); 

XppPreload (1, x ,~ 64); 

XppPreloadCleanO, x, 64); 

x = sint_data[i+64] ; 
S0_64[k] = X [0]; 
dO_64[k] - x[l]; 
3l_64 - x[2]; 

a31_64[k] = x[62]; 
d31_64[k] =■ x[63]; 

sd_64[0] = d0_64[k] = (d0_64[k] « 1) - sO 64 [k] - si 64- 
sd_64[l] = s0_64(k] = (d0-64[k] » 2) + S 064 k ; " 
XppPreload (2, sd_64, 2); ~ 

XppPreload (3, x, 64); 

XppPreloadCleanUl, x, 64); 

x = sint datati+128]; 
s0_128[kl = xtO]; 
d0_128[k] = x[l]; 
sl_128 - x[2]; 

s31__128[k] - x[62]/ 
d31_128[k] = x[63]; 

sd 128 [0] - d0_12B[k] = (dO 128 [k] « 1) - s0 128 [k] - si 128, 
sd_128[l] = S0_128[k] - [d0-128[k] » 2 + sCTiis k ; " 
XppPreload (4, sd 128, 2); ~ 
XppPreload (5, x,~~ 64); 

XppPreloadClean(13, x, 64); 
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x - 6int_data(l+192] ; 
s0_192[k] o x[0); 
d0_192[k] = it[i] ? 
sl_192 a k[2J; 

s31_192[k] » x[62], 
d31_192[k] «= x[63]; 

3d_192[0] - d0_192[k] = (tlO 192[kl « 1) - a 0 192fkl - si i«. 
S d_192[l] - 3 0_192[k] = (d0-192[k » 2 + fl" k ; - 192 ' 
XppPreload (6, sd_192,~ 2) •<• su^^ lKJ , 

XppPreload (7, x, 64) 

XppPreloadClean(15, x, 64) 

XppExeoute ( ) ; 

) /* i < 16*256 */ 

if U >= 4*256) { /* delayed XppSync */ 

k - ti - 4*256) ft 4; 

x — aint_data[i-4*256] , ■ 
XppSync (x, 64); 

d31_0[k] = (d31_0[k] - s31_0[k]) « 1; 
831 0[k] - s31_ 0 [k] + <{x[62] + d31 0[k]) » 3), 
x[0] = sO 0[k] , - 

x[32] = d0~0[k]; 
x[31] - s3l_0[k]; 
x(63] » d31_0[k]; 

x " 4int_data [1-4*256+64 ] t 
Xppsync(x, 64); 

d31_64[k] = (d31_64.[k) - s3l_64[k]) « 1; 
331 64 [k] = s31_64[k] + t(x[62) + d31 64 [k] ) » 3) ; 
x[0] - 30_64[k]; _ 

x[32] - d0_64[k); 

x[31] - 331_64[k] ; 

x[63] - d31_64[k] , 

x = Sint_data [1-4*256+128] ; 

XppSync (x, 64); 

d31_128[k] = td31_128[k) - s31_182[k]) « l; 

331 128 [k] = S31_128[k] + ((x[62] + d31 128 [k] ) » 3) ; 

xtO] = S0_128[lc); 

X[32] = d0_128[k]; 

x[31] = S31_128[k); 

xt63] - <131_128[k] / 

x - fiint_data[i-4*256+192] ; 
XppSync (x, 64); 

d31_192[k] - [d31_192[k] - s31 192 [k] ) « 1; 
■31 1921k] - S3 l_192[k] + <( x [?2] + d31 192[k]) » 3); 
x[0] o S 0_192[k]; 

x[32] a d0_192{k|; 

x[31] = S31_192[k] ; 

x[63] = d31_192[k]; 

) /* i >- 4*256 */ 

> 

for (1~0; j, < 64+16, 1+-4) { /* nt-64, row loop */ 

if (1 < 64) { /* XppPreload and XppExecute */ 

XppPreloadConfig ( XppCf g_wavelet64 ) ; 

k = (1 / 4) % 4; 
x » «int_data[i] ; 
s0_0[k] ~= x[0]; 
d0_0[k] » x[64); 
sl_0 = x[128]; 

s31_0[k] = x[3968); 
d31_0[k] = x[4032] ; 

"2-2 f?J " d °- 0[lc%4 > - (d0_0[k) « 1) - a0 0[k] - 81 0; 
3d_0[l] - 30_0[k] - (dO 0[k] » 2) + go o7k], 
for <j=l; j < 31, j++) ( ™ 

y_0[2*j+l] [k] = x[64+128*j], 

y_0[2*j+2] [k] . x [128+128* j ] ; 
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} 

XppPreload (0, ad_0, 2); 

XppPreload (1, y_0[k], 64); 

XppPreloadClean(9, y_0[k], 64); 

x. - sint_data[i+l] ; 
sO_64[k] = x[0] ; 
d0_64[k] = x[64] ; 
3l_64 = x [128] ; 

s31_64 [k] = x[3968] ; 
d31_64[k) = XC4032]; 

sd_64[0] = d0_64[k] = (d0_64[k] « 1) - 3 0_64[k] - si 64; 
sd_64[l] - 30_64[k] - (d0_64[k] » 2) + 30 64[k]; 
for (j=l; j < 31; j++) { 

y_64[2*j+l] [k] = x[S4+128*j]; 

y_64[2*j+2] [k] - x[12B+128*j]; 

) 

XppPreload (2, sd_64, 2); 

XppPreload (3, y_64[k], 64); 

XppPreloadCleandl, y_64[k], 64); 

x = sint_datati+2] ; 
s0_128 [k] = x[0] ; 
d0_128[k] = x(64] ; 
sl_128 = x[128] ; 

s31_128[k] = x[3968] ; 
d31_128[k] = x(4032]; 

sd_128[0] = d0_128[k) - (d0_128[k) « 1) - sO 128 [k] - si 128; 
sd_128[l] = s0_128[kl - (d0_128[k) » 2) + s0~128[k]; 
for (j-1; j < 31; j++) { 

y_128[2*j+l] [k] = x[64+128*jl; 

y_128[2*j+2] [k] = x[128+128*j] ; 

) 

XppPreload (4, sd_128, 2); 

XppPreload (5, y_128[k], 64); 

XppPreloadClean(13, y_128[k], 64); 

x = sint_data[l+3] ; 
a0_192[k] = x[0]; 
d0_192[k] = x[64] ; 
sl_192 = x[128] ; 

s31_192[k] - x[396B]/ 
d31_192[kl = x[4032]; 

ad_192[0] = d0_192[k] = (d0_192[k] « 1) - s0_192[k] - sl_192; 
sd_192[l] - s0_192[k] = (d0_192[kl » 2) + s0_192[k]; 
for (j-1; j <-31; j++) { 

y_192[2*j+l] [k] = x[64+128*j]; 

y_192[2*j+2] [k] = X [128+128* j],- 

) 

XppPreload (6, sd_192, 2); 

XppPreload (7, y_192, 64); 

XppPreloadClean(15, y_192, 64); 

XppExecute ( ) ; 
} /* i < 64 */ 

if (i >= 16) { /* delayed XppSync */ 

k = (i - 16) ft 4; 

x => £int_data[l-16] ; 
XppSync<y_0[k], 64); 
for j < 31; j++) ( 

x[64*j] = y_0[jl [k]; 

x[2048+64*j] = y_0[j+32] [k]; 

> 

d31_0[k] - (d31_0[k] « 1) - (s31_0[k] « 1) ; 
s31_0[k] = s31_0[k] + ((x[3968] + d31 0[k]) » 3); 
x[0] = sO_0[k]; 

x[2048] = d0_;0[k]; 
x[1984] = s31_0[k); 
x[4032] - d31_0[k] ; 

x = sint_data[i-16+l] ; 
XppSync (y_64[k], 64); 
for (j=l; j < 31; j++) { 

x[64*j] - y_64[j] [k]; 

x[2048+64*j] - y_64[j+32] [k]; 
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} 

d ^-f?f kl ~ < d31 -64[k] « 1) - (33! 64 [k] « 1); 

s31 64 [k] = 331_64[k] + <<x[3968] + d31 64 [k]) » 3); 

x[0] = s0_64[k]; 

x[2048] = d0_64[k] ; 

x[1984] = s31_64[kl 1 

x[4032] = d3L_64[k] ; 

x = sint_data[i-16+2] ; 
XppSync (y_128 [kj , 64) ; 
for (j=l; j < 31; { 

x[64*j] - y_128[j] tk]/ 

^ x[2048+64*j] = y_128[j+32] [k] ; 

d31_128tk] = (d31_128[k] « 1) - (s3l 128 [k] « 1) ; 
s31 128 [k] = S31_128[k] + ((x[3968) + d31 128[k]) » 3) 
x(0] = S0_128[k); - 

x[2048] = d0_128[k] ; 
x[1984] = s31_128[kl; 
x[4032] = d31_128[k]; 

x = sint_data[l-16+3] ; • 
XppSync (y_192[k) , 64); 
for (j=l; j < 31, ( 

x[64*j] - y_192[j] [k] ; 

^ x[2048+64*j] = y_192[j+32] [k]; 

d31 192[k] = (d31_192[k] « 1) - ( S 31 l 92 [k] « 1); 

S31 192 t k) = 931_192[k] + ((KP9681 + _ d31 192UJ) » 3); 

x[0) m s0_192[k]; ~ 

x[2048] = d0_192fk]; 

x[1984] = S31_192[k); 

x[4032] = d31_19.2[k); 

} /* i >= 16 V 

) 

/* nt=32, column loop */ 

/* nt=32, row loop */ 

/* nt=16, column loop */ 

/* nt=16, row loop */ 

} 



5.10.5 Performance Evaluation 

The performance evaluation of this example is based on the assumption that the code optimizations 
done for the XPP are also useful for the reference processor. HenJ we compare the cXTxecuted 
wifcin each configuration only. But this argumentation is not entirely correct for the curan exan^e 

Se^SmTS 611 3 ^ Umn * r ° W ttans P° sition <™» vice versa) for the row loop nest beca^eof 
Jha restricted IRAM size. Th.s optunization is not meaningful for the reference processor. This is why 

Cspolrtion ^ SySt6m P erformance values ^ subtracting the cycles necessary for tte 

The data transfer performance for the configuration _XppCfg_wa V elet64 as part of the column loop 
nest is listed m the following table. It is assumed that there is no data in the cache (startup case) 
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Ie e ^:Sed f b^ -ay sector is already in 

cycles fo/wn te dliS^TSSS T^^T^^T*^ ^ ° 0t bclude 

-JfrpCfg_ W aveIet64 configuration ^given * *" 6XeCUtion of *» who,e 

JESSES J^Et lUZ in SKr/?^ ^ ^ miss 
accesses to the -ayl^as the^ J« P-duced by 

*if_«fata is loaded into the first level cache Th^ folio ■ £ iterations the whole array 

for the remaining 15 fta^J2j3£^ ™n«ta. the data transfer cycles 




"a^ Part of the colons loop access 

summarize the datTl Se Ses Tor hT T T ^ ^ 3t alL ^ foliowi «8 teb '« 
configurations as part ofme ^ ? P ^ _»^o 




■,V • ;- r- .' ^ - 








m ■ i - ;' f> '"v ■' C' ' 


i§ 


rejQad^ a ^; 




i>-': "i.,.j :^V.- "';-..f\. » > 




iteb.ac.ks, « ? ?• ^ •• " ' ° 'v",".-^? 
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The data transfer performance for configuration XppCfs wavelet64 as nart nf th,» ,™, i 

listed in the following table (startup case) —*PP^K™<™Mt>4 as part of the row loop nest is 








MB 





































^ t ^ t 3STS , !r? tab 5 S PreSentthe ° Veral1 reSUltS " 7116 flrst tebIe shows * e ^ase 
the same memory limita Ji om „ R§c h X^ S t OPPI***™ suffer „,th the XPP from 



174 



WO 2005/010632 



PCT/EP2004/006547 



configurations 




^Ft^tca^ 


•pig Cache ! feRi$ 


RefiS^errW? 
'Cache^Fi^ 




Cache', RAM 


wavelet64 (column nest) 
wavelet64 (row nest) 
wavelet32 (column nest) 

wavelet16 (column nest) 
v«avelet16 (row nest) 


"20TC 65 

1792 68 
0 36 
0 36 
0 20 
0 20 


//28 1100 
0 . o 

7728 hoc 
0 0 

7728 1100 
0 0 


212fc;$$> 
116^-116 
W*m 68t&&68 


102O>?$36 
804^96 
492^493 
388K^Ep 
228[?228 
180ft"38G 


«;;2$ 


3.8^|j 
3.3^ 

o,2 

2.61 2,6j 


ail configurations 

configurations 
waveiett>4 (column nest) 


oouo 

BS0 r i 

281U 


*&8StiS7F 
"DCach'e 
68 




3300 
"lCache 


^j?P5I 'H'it'ittkd 


31 12 r 6920 

Ref;,SjsteiTfj;f 




Cachef|^ 


waveletM (row nest) 


1U24 


68 






212(^21 
116f:^ 
£$ffij 116i^i2 
^rjBK 68^251 
68f.^5j| 


1U^U^383S 
804^28 
492^0^ 
388!^$j6 
228i : ; ! 4S| 
180; 


m 3,8^1 

g4 ; 25 4,2<-;^ 

?m 3,3i;^ 


wavelet3^ (column nest) 
wa\jelet32 (row nest) 


b12 
~5T2 


36 

36 






waveletf6 (column nest) 










wavelet16Trow nesQ 
an configurations 


~256 
b3/6 


24b 







The utilization of the _XppCfg_wavelet configurations shows that the XPP capacity is mostly used 
for memory (wavelet64.nml, wavelet32.nml, waveletl 6.nml).The information is taken from the ' info' 
files generated from the NML source code by the XMAP tool. 



Parameter 


Value 


Vector length 


30 (14, 6) 32-bit values 


Reused data set size 




I/O IRA Ms [sum -pet] 


12 - 75% 


ALU[sum-pct] 


12 - 19% 


BREG [def/route/sum-pct] 


37/5/42 - 66% 


FREG [deffroute/sum-pct] 


40/2/42 - 66% 
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5.11 Conclusion 

The theoretical results did not scale well to real world results The bieeest single „prfnrm,n, 0 i 
experienced during placement and routing. This on one hS fi^ffS^ 
toot ' but on the other hand also shows current limitations of the architect aTwelUs of Se 

The following proposals may help to narrow the gap between theoretical and practical performance: 

5.11.1 RAM Bus Width 

A bus width of more than 32 bits is moreapted for such a highly parallel architecture. 

5.1 1 .2 Use of the Cache Instead of Separate IRAMs 

As the utilization of the shadow IRAMs is less than the utilization of the each* th» a a ■ 
wi*out dedicated IRAM memory is more silicon afflci^^^ 

5.11.3 Configuration Size 

^S fl8U ^i? 18 t0 *• ave rage configuration size. The same is true for the 

mstruction cache. The replicated structure of the array allows for a highly parallel reconfmmSinJ w 

betray 8 ^ ^ " ™ " ^ "° * ** ** 16 *S^^S2KS it 

5.1 1 .4 ALU / FREG / BREG Orthogonality 

The NOREG feature is limited to BREGs. Only one BREG in a sequence can be in unregistered mode 
This way it .s possible to save cycles in a backend post optimization if the BREGs can be Tit 
unregistered mode. The number of saved cycles depends on the type ana o der Tf onions Si s 
feature 1S unorthogonal and makes it hard for the compiler to estate the ^ nlZT^s 

SLT ^ S P e ? alizati0 " °f *» forward ™* backward units together with the delays on the busses 

osssz along * e line of 

If FREGs and ALUs are used in line the computational flow either follows the column downward or 
the line in the array. For the latter mode, NOREG BREGs must be used. downward or 

If only BREGs are needed sequentially, the computational flow follows the column in unward 
diction. As at least every second BREG in line must be in registered mode, UrfZ^STS 
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nrtln^ ST**^ * forWard REG ' a backwaid ALU and a backward REG this 

orthogonality would have positive effects on the freedom of placement and routing ^.-this 

5.1 1 .5 Placement and Routing Improvements 

If placement and routing of the critical path is done first, followed by the placement and routine of th, 
less critical components, less registers will be inserted into the critical D £h bv t^r^l? r g , 
several different heuristics should be used in placement ^^^^ P * ^ * g6nera1 ' 

criufar P ath 0m *° PlaCement r ° Uting ^ t0 the C ° mpiler Can he, P "voM the added registers in the 
NML currently does not cover specification of the bus switch elements There is no wav to r^i 

jsasassjr- ^ *-» ~ -« 
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A Method for Compiling High-Level Language Programs to a RecopSgurabJe Data-Row Processor 

1 Introduction 

This document describes a method for compiling a subset of a high-level programming language (HLL) 
]ike C or FORTRAN, extended by port access functions, to areconfigurable data-flow processor (RDFP) 
as described in Section 3. The program is transformed to a configuration of the RDFR 

This method can be used as part of an extended compiler for a hybrid architecture consisting of standard 
host processor and a reconfigurable data-flow coprocessor. The extended compiler handles a full HLL 
like standard ANSI C. It maps suitable program parts like inner loops to the coprocessor and the rest 
of the program to the host processor. It is also possible to map separate program parts to separate 
configurations. However, these extensions are not subject of this document. 

2 Compilation Flow 

This section briefly describes the phases of the compilation method. 

2.1 Frontend 

The compiler uses a standard frontend which translates the input program (e. g. a C program) into an in- 
ternal format consisting of an abstract syntax tree (AST) and symbol tables. The frontend also performs 
well-known compiler optimizations as constant propagation, dead code etimination, common subexpres- 
sion elimination etc. For details, refer to any compiler construction textbook like [I]. The SUIF compiler 
[2] is an example of a compiler providing such a frontend. 

2.2 Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow graph (CDFG) consisting of connected RDFP func- 
tions. This phase is the main subject of this document and presented in Section 4. 

2.3 Configuration Code Generation 

SSlSilf tphaSC <UreCUy * ansl! * &s CDFG to configuration code used to program the RDFP. For 
PACT XPP™ Cores, the configuration code is generated as an NML (Native Mapping Language) file. 

3 Configurable Objects and Functionality of a RDFP 

^1 S ^™ dESCribeS ±e confl S u, » bte ^ts and functionality of a RDFP. A possible implementation 
pI^S 5° tUre iS 8 PACTXPP ™ Core ' Here we o«ly describe the minimum requirements for 
aKDrP for this compilation method to work. The only data types considered are multi-bit words called 
data and single-bit control signals called events. Data and events are always processed as packets cf 
section 3.2. Event packets are called 1-events or 0-events, depending on their bit-value 
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A Method for Compiling Higb-Level Language Programs to a ReconGgurable Data-How Process or 
3.1 Configurable Objects and Functions 

An RDFP consists of an array of configurable objects and a communication network. Each object can 

ufe?off5^ t0 Perf r °Tt fUnCtiODS °** b6lOW) - 11 perfonns same repeated^ until 

the configuration is changed. The array needs not be completely uniform, i. e. not all objects S to be 

able to perform all functions. E. g., a RAM function can be implemented by a speciaS^AM obtet 

which cannot perform any other functions. It is also possible to combine several object a 

J^s'C fiinCd0DS ' ^ **" ° bjeCtS ^ ^ ^ ' te C ° mbined 10 rCaliZe 3 ^ 
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Figure 1: Functions of an RDFP 

£ e a^ 

• ALU[opcode]: ALUs perform common arithmetical and logical operations on data. ALU func- 

SnTA 68 VT^ aVai l 8b,e f0f ^ 0perati ° nS m&i iD 1116 ^ ALU functions have two 
date mputs A and B, and one data output X. Comparators have an event output U instead of the 

data output They produce a 1-even t if the comparison is true, and a 0-event odierwise 
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• CNT: A counter function which has data inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next output value (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, the counter counts 
continuously. The output events U, V, and W have the following functionality: For a counter 
counting N times, N-l 0-events and one 1-event are generated at output U. At output V N 0-events 
are generated, and at output W, N 0-events and one 1-event are created. The 1-event at W is only 
created after the counter has terminated, i. e. a NEXT event packet was received after the last data 
packet was output 

• RAM[size]: The RAM function stores a fixed number of data words ("size"). It has a data input 
RD and a data output OUT for reading.at address RD. Event output ERD signals completion of 
die read access. For a write access, data inputs WR and IN (address and value) and data output 
OUT is used. Event output EWR signals completion of the write access. ERD and EWR always 
generate 0-events. Note that external RAM can be handled as RAM functions exactly like internal 

• GATE: A GATE synchronizes a data packet at input A back and an event packet at input E When 
both inputs have arrived, they are both consumed. The data packet is copied to output X and the 
event packet to output U. ' 

• MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data output X If 
SEL receives a 0-event input A is copied to output X and input B discarded. For a 1-event B is 
copied and A discarded. ' 

• MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X 
... If SEL receives a 0-event, input A is copied to output X, but input B is not discarded. The packet 

is left at the input B instead. For a 1 -event, B is copied and A left at the input 

. DEMUX: A DEMUX function has one data input A, an event input SEL, and two data outputs X 
and Y. If SEL receives a 0-event, input A is copied to output X, and no packet is created at output 
Y. For a 1-event A is copied to Y, and no packet is created at output X. 

• Jf? ATA: A function multiplicates data packets. It has a data input A, an event input 
SEL, and a data output X. If SEL receives a 1-event a data packet at A is consumed and copied 
to output X. For all subsequent 0-event at SEL, a copy of the input data packet is produced at the 
output without consuming new packets at A. Only if another 1-event arrives at SEL, the next data 
packet at A is consumed and copied. 2 

• INPORT[name]: Receives data packets from outside the RDFP through input port "name" and 
copies them to data output X. If a packet was received, a 0-event is produced at event output U 
too. (Note that this function can only be configured at special objects connected to external busses.) 

• OUTPORT[name]: Sends data packets received at data input A to the outside of the RDFP through 
output port "name". If a packet was sent, a 0-event is produced at event output U, too. (Note that 
this function can only be configured at special objects connected to external busses.) 

Additionally, the following functions manipulate only event packets: 
2 Note that this can be implemented by a MERGE with special properties on XPP™ . 
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• 0-FILTER, 1 -FILTER: A FILTER has an input E and an output U. A 0-FILTER copies a 0-event 
from E to U, but 1 -EVENTS at E are discarded. A 1 -FILTER copies 1-events and discards 0-events. 

• INVERTER: Copies all events from input E to output U but inverts its value. 

• 0-CONSTANT, 1-CONSTANT: 0-CONSTANT copies all events from input E to output U, but 
changes them all to value 0. 1-CONSTANT changes all to value 1. 

• ECOMB: Combines two or more inputs El, E2, E3..., producing a packet at output U. The output 
is a 1-event if and only if one or more of the input packets are 1-events Gogical or). A packet must 
be available at all inputs before an ouput packet is produced. 3 

• ESEQJseq]: An ESEQ generates a sequence "seq" of events, e.g. "0001", at its output U. If it 
has an input START, one entire sequence is generated for each event packet arriving at U. The 
sequence is only repeated if the next event arrives at U. However, if START is not connected, 
ESEQ constantly repeats the sequence. 

Note that ALU, MUX, DEMUX, GATE and ECOMB functions behave like their equivalents in classical 
dataflow machines [3, 4]. 

3.2 Packet-based Communication Network 

The communication network of an RDFP can connect an outputs of one object (i. e. its respective func- 
tion) to die input(s) of one or several other objects. This is usually achieved by busses and switches. By 
placing the functions properly on the objects, many functions can be connected arbitrarily up to a limit 
imposed by the device size. As mentioned above, all values are communicated as packets. A separate 
communication network exists for data and event packets. The packets synchronize the functions as in a 
dataflow machine with acknowledge [3]. I. e., the function only executes when all input packets are avail- 
able (apart from the non-strict exceptions as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-flow graph mapped to an RDFP self-synchronizes its 
execution without the need for external control. Only if two or more function outputs (data or event) are 
connected to the same function input ("N to 1- connection"), the self-synchronization is disabled. 4 The 
user has to ensure that only one packet arrives at a time in a correct CDFG. Otherwise a packet might 
get lost, and the value resulting from combining two or more packets is undefined. However, a function 
output can be connected to many function inputs ("1 to N connection") without problems. 
There are some special cases: 

• A function input can be preloaded with a distinct value during configuration. This packet is con- 
sumed like a normal packet coming from another object 

• A function input can be defined as constant. In this case, the packet at the input is reproduced 
rep eatedly for each function execution. 

3 Note that this function is implemented by the EAND operator on the XPf™ . 

*Note that on XPP™ Cores, a "N to 1 connection" for events is realized by the EOR function, and for data by just assimine 
several outputs to an input. 
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An RDFP requires register delays in the dataflow. Qtherwise very long combinational delavs and « M 

£ Zf , f } and . m r0Uting SegmentS ° f 1,16 ^unication network Note that regis^ThanS 
the tuning, but not the functionality of a coirectCDFG. "=S»siers cnange 

4 Configuration Generation 
4.1 Language Definition 

The following HLL features are not supported by the method described here: 

• pointer operations 

• library calls, operating system calls (including standard I/O functions) 

• illusive function calls (Note that non-fecursive function calls can be eliminated by function in 
lining and therefore are not considered here.) y nmcaon ln " 

• i^ZT* ^ converted to tyP e integer. Integer values are equivalent to data packets in 
the RDFP. Arrays (possibly multidimensional) are the only composite data ^SS^ 

The following additional features are supported: 

4.2 Mapping of High-Level Language Constructs 

This method converts a HLL program to a CDFG consisting of the RDFP functions defined in Section 3 1 

The CDFG is generated by a traversal of the AST of the HLL program. It orocesses th* „™,™ « ♦ 

# S^Ji, 0intS t0 " CVent ° UtpUt ° f 8 taelIon - ™" out P ut Olivers a 0-event whenever 
the program execution reaches this program point At the beginning a 0-CONSTANT ^ZST * 

"LT * JT- ™ s eveM is used to start 1,16 overa11 «cSTSi 

• "r gCnerated a Part has finished executing is used as new STAR-? 

signal for the following program pa rts, or it signals termination of the entire prog^ Z £££ 
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events guarantee that the execution order of the original program is maintained wherever the data 
dependencies alone are not sufficient This Scheduling scheme is similar to a one-hot controller 
for digital hardware. 

• VARLIST is a list of {variable, function-output} pairs. The pairs map integer variables or array 
elements to a CDFG function's output The first pair for a variable in VARLIST contains the 
output of the function which produces the value of this variable valid at the current program point 
New pairs are always added to the front of VARLIST. The expression VARDEF(var) refers to the 
junction-output of the first pair with variable var in VARLIST. 6 

The following subsections systematically list all HLL program components and describe how they are 
processed, thereby altering the CDFG, START and VARLIST. 



4.2.1 Integer Expressions and Assignments 

Straight-line code without array accesses can be directly mapped to a data-flow graph One ALU is 
allocated for each operator in the program. Because of the self-synchronization of the ALUs no explicit 
control or scheduling is needed. Therefore processing these assignments does not access or alter START 
The data dependences (as they would be exposed, in the DAG representation of the program [1]) are 
analyzed through the processing of VARLIST. These assignments synchronize themselves through the 
data-flow. The data-driven execution automatically exploits the available instruction level parallelism. 
All assignments evaluate the right-hand side (RHS) or source expression. This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments the 
left-hand side (LHS) variable or destination is combined with the RHS result object to form a new'pair 
{LHS, result(RHS)} which is added to the front of VARLIST. * 

The simplest statement is a constant assigned to an integer 7 
a - 5; 

It doesn't change the CDFG. but adds {a, 5} to the front of VARLIST. The constant 5 is a "pseudo- 
object * which only holds the value, but does not refer to a CDFG object Now VARDEF(a) equals 5 at 
subseqent program points before a is redefined. 

Integer assignments can also combine variables already defined and constants: 
b - a '* 2 + 3; 

In the AST, the RHS is already converted to an expression tree. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant the ALU's input is direcdy connected to that constant. If a leaf note is an integer 
variable var, it is looked up in VARLIST, i. e. VARDEF(var) is retrieved. Then VARDEF(var) (an output 
of an already existing object in CDFG or a constant) is connected to the ALU's input The output of the 
ALU corresponding to the root operator in the expression tree is defined as the result of the RHS Finally 
a new pair {LHS, result(RHS)} is added to VARLIST. If the two assignments above are processed, the 

^This method of using a VARLIST is adapted from the Transmogrifier C compiler [5]. 
Note that we use C syntax for the following examples. 
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CDFG witt, two ALUs in Fig. 2 is created." Outputs occurring in VARLIST are labeled bv Rn mfm 
numbers. After these two assignments, VARLIST = [f b I> fa. 5^1 (The, frnnt IT y < 
side.) Note that all inputs connected to a constant t^'SS^^^ 51" ° n *? ^ 
^ARL*« 



4.2.2 Conditional Integer Assignments 



For conditional if-then-else statements containing only integer assignments obiects for ^i,- 
uauon are created first. The object event output indfctating the cSSSS.jSS^ 
the correct branch result later. Next, both branches are processed hi rZlw »!• P loosing 
VARLIST1 and VARLIST2 of VARLIST. (VA^t74^^^ 

added to VARLIST1 or VARLIST2, a new entry for VARLIST i*Z^7 J *" vanabIes 

definitions from VARLIST1 and VAmST^ a7e^ 

is selected by the condition result. For variables only defined in one of teZo SSL ,h ^ IT* 
uses fteresult retrieved from me on 

not have an entry for this variable, a special "undefined" constant value is used Til t 

T 8 ™ value wi]1 never be - t * - ' oT^lTm^Z 

xf-then-else structure need to be added.to. VARLIST in the combination phased 
Consider the following example: 



i = 7; 
a = 3; 
if (i < 10) 

■a = 5; 

c = 7; 

} 

else { 

c = a - 1; 
d = 0; 

} 



Kg. 3 shows the resulting CDFG, Before the if-then-else construct, VARLIST = rfa "tt li 7li a* 
processmg the branches, for the then branch, VARUST1 = [fc 7} f a5V fa 3? Si- nf ^ 

Bomtran S C hT diti ° nal *!* ^ ** 6Xplicit ^ ^ *» not change START 

gg g^a^^T kl ^ SynChr ° niZed * ^ 11 * — - £Z£ 

^^^^t^z^^zx r p ii r • * rs - 3 • a,s ° nwe - *- 
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4.2.3 General Conditional Statements 

Conditional ^statements s containing either array accesses (cf. Section 4.2.7 below) or inner loops cannot 
^ des f bed * 4 - 2 ' 2 - Data packets must only be sent to the active w£ S£2 

achieved by the nnplementation shown in Fig. 8, similar to the method presented in [4]. 

A dataflow analysis is performed to compute used sets use and defined sets def [1] of both branches "> 
For the current VARUST entries of all variables in IN = vselthenbJJ i j iZTn, . 
use { elsebody) U dej^ody) U use(Header), DEMUX functioi^o^olled^by Iteff cMdWra are 
inserted. Note that arrows with double lines in Fig. 8 denote connections for all vSefin m ™d the 

SfflS 1 ^ ^ StandS /° r SeV6ral DEMUX fUnCtiODS » 006 f ° r — ^SJl » ^ DE 
MUX functions forward data packets only to the selected branch. New lists VARUST1 and VARLIST2 

VARLIST1, and the else branch with VARUST2. Finally, the output values are combined. OUT con- 

S a cnnrTn ^ ** T* ^ 0ne branch is ™ activated mere will not 

be a confl IC t due to two packets arriving simnltanuously. The combinations will be added to VAMJST 
ato *e condmonal statement. If the IF execution shall be pipelined, MERGE opcodes for the oumut 
must be inserted, too. They are controlled by the condition like the DEMUX functions. * 

The following extension with respect to [4] is added (dotted lines in Fir. 8) in order to control the execu- 
tion as mentioned above with START«vents: The START input is ECOMB-combined wiHe conation 
output and connected to the SEL input of the DEMUX functions. The START inputs^enbody Z 
elsebody ^generated from the ECOMB output sent through a 1 -FILTER and a 0-CONSTAOTn^ r 
through a 0-FILTER, respectively. The overall STABT^ output is generated by a stari^ to [ 

coSn°r « tb ~*»W 1 ™ d elsebod y' s STARTS outputs. Wxth this extend a2£y nestea 
conditional statements or loops can be handled within thenbody and elsebody. 

4.2.4 WHILE Loops 

WHILE loops are processed similarly to the scheme presented in [4], cf. Fig. 9. As in Section 4 2 3 dou- 
ble hneconnecuons and shaded MERGE and DEMUX functions represent duplication ^1^'^ 
mm. Here IN = use( W hilebody) U def(v,hUebody) U uee(heJer). The WHILE loop execuS £ 

^^^isyss^ *• mbm seiect * *** *» vE "top 

entry (SBUO). The MERGE outputs are connected to the header and the DEMUX functions If the 
wWe condition is true (SEL=1), the input values are forwarded to the whilebody, otherwise to GOT 
MnTL 1 V* ™f * b0dy °" fCd b3Ck 10 Whi,eb0d y' S in P ut ™ MERGE and DeSS 

^r^a^ 

Two extensions with respect to [4] are add ed (dotted lines in Fir. 9): 

TCe 0-CONSTANT is required since START events must always be 0-events 

sincet^ 
the^GEfunctio^^^ 

INI always arrive at the DEMUX's inplSZlL^Zve P TOCeSS «™» «* P«*- ««o 
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• In [4], the SEL input of the MERGE Amotions is preloaded with 0 Hence the i™« . 
begins immediately and can be executed only once Inst^T^Zmef^l^ ??T 
MERGE's SEL input («2 to 1 connection" wim the header onZ) ^altw f^ZS *** 
of the start of the loop execution and to restart it COntr01 



SnstI^ f A ? T ^ * COnneCted t0 ^ header out P ut . sent through a 1-FILTER/O 
CONSTANT combmation as above (generates a 0-event for each loop L Pc™£~ 

combining whilebody's START new output with the header outpuTfor IhTwilH; >*Z h 

fooU~ nex 0 l op t iteration *!p started after * e ~ - S " tTS 

loop s 5TARr new output is generated by filtering the header output for a 0-event 
|«> these extensionSf ^ conditional gtatements of ioQps ^ ^ ^ ^ ^ 



4.2.5 FOR Loops 



reSr^ 

«sr iffi^^ssr sari 4 ^ - — - 

to the respective inputs. expressions (see Sections 4.2.1 and 4.2.7) and connected 

CNT'sX output provides the current value of the loon index variahln Tfth»fs..i- * 

(live) after th» wr.t> • , , . • V "««« vanaoie. it the final index value is reauirerl 

ESS ^x^^^^*:? mBide *• t - - 

input value flrom VART tct. m „c,7l nantu ~ a enay * UnIess 2t IS 3 constant value, the variable's 
preloaded with a Invent to select the (lr» t» n,.T . . 7 ^ ^ SEL m P ms a™" •» 
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/•The following control events (dotted lines in Fig. 10) are similar to the WHTT p . . 

1 sunplerCNT's STARTinput is connected to the Foop's oZ^^tg^]^ but 
from CNT's W output, sent through a 1-FDLTER and 0-CONSTANT C^rt v«*ff T." genentted 
event for each loop iteration and is therefore usedl forbody^^T^nll I rS 'S**" ^ °" 
connected to forbody^s STAJZT ncu , output CNT 8 *P ut * 

C^rTmxV^T f fined , belOWin SeCtiOT 4 ' 2 - 6 >» lo °P *«*» are allowed to overlap Therefor, 
i s next input needs not be connected. Now the cmmtm- nmH.,^, • , , ovena P- inerefore 

events as fast as they can be eonsntnedlSwe^^ ^TSt^v?^ r*'' ^ "* 
START nem ontpnt since the counter uJ^ to£?Z w •? ^f"? ™ »" a^ent overall 

-^HencenieventltSSu^^ 

example compilation with and without pipelining. ' " SK,10p 4 3 loop 

a comparator for the counter functionality. ivinKut/UiiMUX/ADD loop and 

• Variables in IN2 (only used in forbody) are reproduced in the snecial MData a 

42.6 Vectorization and Pipelining 

the compilation method presented here Z™Trl*A £ ? "technique [6] can be easily applied to 

^X^STe^Seuceal ££5? " *T —V — 
subsequent iteration Loods wmT^v,^ ? ™ '° °° e """^ anMhet SBt «>eM in a 

no. cause loop^^X^TJ^T 0- e. RAM) accesses * 

is wrinen in oue^d reX^s^ £Sf "K^T? ' T ? ^ «*= "o RAM address 
RAM may overlap This decree rfTIJEL^ ? ■. 1 * ereft « the read and write accesses to the same 

4-2-7 Array Accesses 
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not have a single address space. Instead, the arrays are allocated to several RAM, tv , * 
^tapp-chtohandhngl^ 
I To reduce the complexity of the compilation process, array accesses are nm,^- „ _ 
1 uses "pseudo-functions" for RAM read and write ^^TS^StTf u ° phaSeS ' Phase 
(read address) and an OUT data output (read^lTlrf » p a^ ff ^ ^tion has a RD data input 
inputs (write address and J^Z) to2lo^s Z f^JX^l ^ W *"* 

both have a START event input and a U ££££ T^e events "nl^i t0 ' 
accesses to the same RAM are combined and ^S^^SS^^T ^ * PhaSe 2 811 
This involves manipulating the data and event inpuTanl^ 

maintained and the outputs are forwarded to the correct oarTo^h! ISS eXeCUtt ° n ^ iS 

J^Phase 1 Since arrays are allocated to several RAMs, only accesses to the same pamk, 
r ^chronized Accesses, to different RAMs can occur concurrently TJw£TS J * ^ SyD " 

^ dependencies, the accesses self-synchronize automatically. WiLn pSnS t«T , ~ ° f ^ 
1 wnte accesses to the same RAM have to be svnch tI- pip f lmed loo P s » not even read and 

<■» ofteTwZiS tata ^£1ZZ £5"" P™**. ft. address. If appB^. 

. wnte. WR tap,. aJ?SS"Sf 53«" ^5^^^ £■* ^ 



^ - a[±]/ 
y = a[j]; 
2 - x + y; 



Jf EJ&13 shows the translation of the following write access: 
a[i] B x . 
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Phase 2 We now merge the pseudo-functions of all accesses to th* « nm * pa \* a . 
a single RAM function. For ali data inputs (RD for read L WR^nd^ ^.f 3 ^ by 
are inserted between the input and the RAM function S E tZ; &CCeSS) ' ° ATEs 

START inputs of the original pseudo-fun^Ta RAft^ Lh Z T * ** respective 

theUoutput of theread Z write ^ r^r^T * 
thesingieaccessaEi] = * ; fromFig. 13 li «JS^^^^^« 
However, if several read or several write accesses fi e mmKi n .fim M i M . 

to the same RAM occur, the ERD or EWR events ^^9M^p^^OT^ore°Biu a^^l^^ 8111 ' M> *? tS f 
the original pseudo Junction should only be generated for Ite msnecd™ ZZ. ■ event of 
ran access. This is achieved by connectini! L STAXtZiZ,!?TS l ^ pomt 1 *or the cur*- 
of the same type (mad or w^rXS^dlT^ffof f****** fr-**"^ 
tog signal pmduces an event for every accesT toil ™IvSf^ a " T8nt aCcess ' ^ 

ECOMB-combined the ^a^^^^^lS^ " 'T-' ™* «ent is 
fte access is completed Because ECOMB OR^Site^nSl ^ " fter 

duces a 1-evem. Nea, this event is (iltered ^* aTmra ^ OTTa " 

data packet to arrive at the inputs. *««ss occurs at a time, the GATEs only allow one 

-ytLurl^^ 

oniy fonvards i«1aZ« rS a >tvT^T- a seiecdve gate which 

0-eVent The signal crated by the ^O^Z^ZZt^^ ^ * 

Ptooessed like the ERD ^££££f** ^ E " ,J "" > ^ use ^- The EWR output is 

This transformation ensures that ail RAM accesses are executed correcdv but if t. »„. , , . ■ 

or wnte accesses to the am,s am • ,• / ^""<=o correctly, out it is not very fast since read 

b «^^^U^^^^iT m ^ »"» *<* *• P»=vious one 

as follws: CTen * RAM bemg used has several ptpehne stages. This inefficiency can be removed 

^M^SLt'T IT f — ( "°' ^ -"Wn a basic Ho* am 
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possible to stream data into the RAM rather than waiting for the previous access to complete For im 
purpose, a combination of MERGE functions selects the RD or WR and W ZmZ toUZ - 
by the sequence. The MERGEs must be controlled by iterative ESEOs mm^nZ? I *l " WWI 

j a sequence. A combination of DEMUX functions with the same ESEQ control £ IZ 
f 1 efficient to arrange the MERGE and DEMUX functions as balanced binary & ' 

The START new signal is generated as follows: For a sequence of length n the START cS™*. * 
entire sequence is replicated n times by an ESEQTOO 11 function w nSL stapt • ^ ° f 
the sequence's START. Its output is dirLiy '^o^ 

(for single accesses) or ESEQ outputs sent through 0-CONSTANT Obr^SiS^SS 
connected to EWR or ERD, respectively, and sent through a l-HL^RJ^N^A^c 0 ^nT' 
similar to the basic method described above. Since only the last ESPO , combination, 
l-r RAM access generates a START^ as ^LLll^fr ^^J££ 

STA^Zr "° SeDt a ^ (With ° Ut ^ E ^ ^ 

^^3^^^ ? ^ f^ 6 12 and 4) using the ESEQ-method for 

generating bTAHT new , and Fig. 6 shows the final CDFG of the follftwina \JL— i • . , 



x - a[i] ; 
y - a[j] ; 
z = a[k]; 



even, is ^ to con*,) a to^ag. DEMUX wWcn la ta^^S^Z^S^^ 



4.2.8 Input and Output Ports 



Input and output ports are processed similar to vector arpp««« a 

Output ports control the data packets by GATEs like array write accesses Th P qtyvd • , • , 
created as for RAM accesses. accesses. The STOP signal is also 
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4.3 More Examples 

Fig. 7 shows the generated CDFG for the following for loop, 
a -= b + c; 

for (±=0; 1<=10; i++) { 
a = a + i; 
x[i] = k; 

} 

In this example, INI = {a} and IN2 = fk\ (cf He lto Th* mprpr * * ■ . 
replaced by a 2:1 data ccLido, as ^ib ^i^S ^^ ' ' S 
data pacta arrives for variaHes b. c and 1, and one final paeke. kSlfei £? t^T" 
not use a START event since both operations (the adder and the RAM , . ^ ""^ 

by the counter anyway. Bn, the RAM's w4 ^^^^0^^^ 
CNT-s NEXT input Note .hat thepipeltaing optimization T*™ 4 ™ and ,. 00 ™ ectt!d <° 
is applied ( which is possime for thJsLp). Sr^r^^t r Z^Z 75*% ^'J 

RAM's EWR output), as defined at the end of Section 4.23. START ™° a 

I The following program contains a vectorizable (pipelined) loop with one write access to am „ n> a »,% 
a^seonence „, two read accesses „ array (RAM) y. AlJthe loop^^rTac^ y 



2 = 0; 

fox (i=0; i<=10; i++) { 

xfi) - a; 
^ Z - z + yfi] + y[2*i]; 

« - yfk]; 



Fig. 15 shows the intermediate CDFG generated before the array access Phase 7 t^nrf,™,- • 
Phed. The pipelined loop is controlled as follows: Within u^ZoT^^l^^ZT " * P " 
accesses to x and read access* .a v -n. " . T' ^P 3 ™** ^TART signals for write 

Fig. 16 shows the final CDFG after Phase 2 Note that *Vu/.1ai» ;„ ~ j • 

needs no additional control, and -SSJl^ta^SS^ SSS? ^ 
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Datapath and Compiler Integration of Coarse-arain 
Reconfigurable XPP-Arrays into Pipelined RISC ^Processors 



Abstract - Nowadays, the datapaths of modern 
microprocessors reach their limits by using static 
instruction sets. A way out of this limitations is a 
dynamic reconfigurable processor datapath 
extension achieved by integrating traditional static 

S2?" u, Wfth * C «"■"•*■!■ dyriSc 
reconfigurable XPP-architecture (extreme 
Processing Platform). Therefore, a lS 
asynchronous coupling mechanism of the 
corresponding datapath units has been developed 
and m egmted onto a CMOS 0.13 urn standard cell 
technology from UMC. Here the SPARC 
compatible LEON processor is used, whereat 
t datapath has been 

extended to be configured and personalized for 
specific applications. This allows a various and 

S?e MPEoi *f » *"»»fcS application domains 
W»MFBC3-4, digital filters, mobile communication 
modulation, etc. The chosen coupling technique 
allows asynchronous concurrency of the additionally 
configured compound instructions, which are 
integrated into the programming and compilation 
environment of the LEON processor. 



Introduction 

r?c e n JS tati0nS °l C0nventi0I «'l processors are 
becoming more and more evident. The growine 

SSZL ° f T _baSed a PP» c ^ make! 
25?*™ dynamically reconfigurable 

architectures an attractive alternative [3] [41 T61 

S' ^ C ? , 2 >ine ? 6 P^ 0 *™ °f ASICs, which 
are very nsky and expensive (development and 

P-ess^.^ ° f ~ 

In spite of the possibilities we have today in VLSI 

the L basic conce P te of microprocessor 
architectures are the same as 20 years ago. The main 
prooesing unit of modern convention^ 
microprocessors, the datapath, in its actual structure 
follows the same style guidelines as its 
predecessors Although the development of 
pipehned architectures or superscalar concepts £ 

increases the performance of a modern 



microprocessor and allows higher frequency rates 
die mam concept of a static datapath remaps 
"? ° P ! rati0n iS a com P° s ition of bS 

££Si * e used processor owns - S 

execSn* I" proc ~ 1 «»«l* ^ys in the ability of 
mi g " S C ,° ntrol dominant application. Data 
or sfream oriented applications are not well suited 

Ixecu^onTnT^- 7" 3eqUentiaJ 

execution isn t the nght target for that kind of 

applications and needs high bandwidth because of 

permanent retransmitting of instruction/data from 

*L?£T?' ™ s handicap is often eased 5 

using of caches in various stages. A seauentf«i 
nuereonnection of filter,, which\> tite 22 
data manipulating without writing back ft! 
mtermediate results would get the righf opSntatta 
and reduction of bandwidth. Practically this kind of 
cham of filters should be construS'ta a Tglcaf 
way and configured during runtime fiSSK 
approach to extend instruction sS usf sS 
modules, not modifiable during runtime 
Customized microprocessors or ASTCs are 

• rTtLSL"? S P ecia ' Ration environment. 
,Jt is nearly impossible to use the same 

for ^er application withTut 
loosing the performance gain of this architecture 

d\mD^ aPPr0aCh f 3 fleXib ' e ^ W S h Performance 
datapath concept ,s needed, which allows to 

reconfigure me functionality and make this core 

mamly application independent without losing the 

performance needed for stream-based applications 

This contribution , introduces a new coSm of 

loosely coupled implementation of the d^o 

reconfigurable XPP architecture from PACT cZ 

S>N T° ° f *» SPARC «">!*&£ 

LEON processor. Thus, this approach is different 
from those, where the XPP operas as a comply 
separate (master) component wimfa P one 
Configurable System-on-Chip (CsoC), together wife 
a processor core, global/local memory topoLg S 
and effiaent multi-layer Amba-bus inSrfac es [1 11 
Here, from the programmers point of view the 
extended and adapted datapath seLs like a dylnic 
configurable instruction set It can be customiSS 
a specific application and accelerate the execution 
enormously. Therefore, the programmer Z to 
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create a number of configurations, which can be 
uploaded to the XPP-Array at run time, e.g. this 
configuration can be used like a filter to calculate ■ 
stream-oriented data. It is also possible, to configure 
more than one function in the same time and use 
them simultaneously. This concept promises an 
enormously performance boost and the needed 
flexibility and power reduction to perform a series 
of applications very effective. 



1. LEON RISC Microprocessor 

For implementation of this concept we chose the 32- 
bit SPARC V8 compatible microprocessor [1] [2], 
LEON. This microprocessor is a synthesisable, free 
available VHDL model which has a load/store 
architecture and has a five stages pipeline 
implementation with seperated instruction and data 
caches. 
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Figure 1: LEON Architecture Overview 

As shown in Figure 1 the LEON is provided with a 
full implementation of AMBA 2.0 AHB and APB 
on-chip bus, a hardware multiplier and devider 
programmable 8/16/32-bit memory controller for 
external PROM, static RAM and SDRAM and 
several on-chip peripherals such as timers, UARTs 
interrupt controller and a 16-bit I/O port. A simple 
power down mode is implemented as well. 
LEON is developed by the European Space Agency 
(ESA) for future space missions. The performance 
of LEON is close to an ARM9 series but don't have 
a memory management unit (MMTJ) 
implementation, which limits the use to single 
memory space applications. In Figure 2 the 
datapath of the LEON integer unit is shown 




Figure 2: LEON Pipelined Datapath Structure 

2. extreme Processing Platform - XPP 

The XPP architecture [6], [7], [8] is based on a 
hierarchical array of coarse-grain, adaptive 
computing elements called Processing Array 
Elements (PAEs) and a packet-oriented 
communication network. The strength of the XPP 
technology originates from the combination of array 
processing with unique, powerful run-time 
reconfiguration mechanisms. Since configuration 
corilrol is distributed over a Configuration Manager 
(CM) embedded in the array, PAEs can be 
configured rapidly in parallel while neighboring 
PAEs are processing data. Entire applications can be 
configured and run independently on different parts 
of the array. Reconfiguration is triggered externally 
or even by special event signals originating within 
the array, enabling self-reconfiguring designs. By 
utilizing protocols implemented in hardware data 
and event packets are used to process, generate 
decompose and merge streams of data. 
The XPP has some similarities with other coarse- 
grain reconfigurable architectures like the 
KressArray [3] or Raw Machines [4]. which are 
specifically designed for stream-based applications. 
XPP's main distinguishing features are its automatic 
packet-handling mechanisms and its sophisticated 
hierarchical configuration protocols for runtime- and 
self-reconfiguration. 
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2.1 Array Structure 

A CM consists of a state machine and internal RAM 
for configuration caching. The PAC itself (see top 
right-hand side of Figure 3) contains a configuration 
bus which connects the CM with PAEs and other 
configurable objects. Horizontal busses carry data 
and events. They can be segmented by configurable 
switch-objects, and connected to PAEs and special 
I/O objects at the periphery of the device. 
A PAE is a collection of PAE objects. The typical 
PAE shown in Figure 3 (bottom) contains a BREG 
object (back registers) and an FREG object (forward 
registers) which are used for vertical routing, as. well 
as an ALU object which performs the actual 
computations. The ALU performs common fixed- 
point arithmetical and logical operations as well as 
several special threeinput opcodes like multiply-add, 
sort, and counters. Events generated by ALU objects 
depend on ALU results or exceptions, very similar 
to the state flags of a classical microprocessor. A 
counter, e.g., generates a special event only after it 
has terminated. The next section explains how these 
events are used. Another PAE object implemented 
in the XPP is a memory object which can be used in 
FIFO mode or as RAM for lookup tables, 
intermediate results etc. However, any PAE object 
functionality can be included in the XPP 
architecture. 

2.2 Packet Handling and Synchronization 

PAE objects as defined above communicates via a 
packet-oriented network. Two types of packets are 
sent through the array: data packets and event 
packets. Data packets have a uniform bit width 
specific to the device type. In normal operation 
mode, PAE objects are selfsynchronizing. An 
operation is performed as soon as all necessary data 
input packets are available. The results are 
forwarded as soon as they are available, provided 
the previous results have been consumed. Thus it is 
possible to map a signal-flow graph directly to ALU 
objects. Event packets are one bit wide. They 
transmit state information which controls ALU 
execution and packet generation. 

23 Configuration 

Every PAE stores locally its current configuration 
state, i.e. if it is part of a configuration or not (states 
..configured" or „free"). Once a PAE is configured, 
it changes its state to ..configured". This prevents 
the CM from reconfiguring a PAE which is still 
used by another application. The CM caches the 



configuration data in its internal RAM until the 
required PAEs become available. 




Figure 3: Structure of an XPP device 



While loading a configuration, all PAEs start to 
compute their part of the application as soon as they 
are in state configured". Partially configured 
applications are able to process data without loss of 
packets. This concurrency of configuration and 
computation hides configuration latency. 

2.4 XPP Application Mapping 

The Native Mapping Language (NML), a PACT 
proprietary structural language with reconfiguraton 
primitives, was developed by PACT to map 
applications . to the XPP array. It gives the 
? :programmer direct access to all hardware features. 
In NML, configurations consist of modules which 
are specified as in a structural hardware description 
language, similar to, for instance, structural VHDL, 
PAE objects are explicitly allocated, optionally 
placed, and their connections specified. Hierarchical 
modules allow component reuse, especially for 
repetitive layouts. Additionally, NML includes 
statements to support configuration handling. A 
complete NML application program consists of one 
or more modules, a sequence of initially configured 
modules, differential changes, and statements which 
map event signals to configuration and prefetch 
requests. Thus configuration handling is an explicit 
part of the application program. 
A complete XPP Development Suite (XDS) is 
available from PACT. For more details on XPP- 
based architectures and development tools see [6]. 



204 



WO 2005/010632 



PCT7EP2004/006547 



3. LEON Instruction Datapath Extension 

The system is designed to offer a maximum of' 
performance. LEON and XPP should be able to 
communicate with each other in a simple and high 
performance manner. While the XPP is a dataflow 
orientated device, the LEON is a general purpose 
processor, suitable for handling control flow [1], [2]. 
Therefore, LEON is used for system control. To do 
this, the XPP is integrated into the datapath of the 
LEON integer unit, which is able to control the 
XPP. 



gpanm© 



•EES- 
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Figure 4: Extended Datapath Overview 

Due to unpredictable operation time of the XPP 
algorithm, integration of XPP into LEON datapath 
is done in a loosely-coupled way (Figure 4). Thus 
the XPP array can operate independent from the 
LEON, which is able to control and reconfigure the 
XPP during runtime. Since the configuration of XPP 
is handled by LEON, the CM of the XPP is 
unnecessary and can be left out of the XPP array. 
The configuration codes are stored in the LEON 
RAM. LEON transfers the needed configuration 
from its system RAM into the XPP and creates the 
needed algorithm on the array. 
To enable a maximum of independence of XPP 
from LEON, all ports of the XPP - input ports as 
well as output ports - are buffered using dual clock 
FIFOs. Dual-clocked FIFOs are implemented into 
the IO-Ports between LEON and XPP. To transmit 
data to the extended XPP-based datapath the data 
are passed through an IO-Port as shown in Figure 5. 
In addition to the FIFO the IO-Ports contain logic to 



generate handshake signals and an interrupt request 
signal. The IO-Port for receiving data from XPP is 
similar to Figure 5 except that the reversed direction 
of the data signals. This enables that XPP can work 
completely independent from LEON as long as there 
are input data available in the input port FIFOs and 
free space for result data in the output port FIFOs. 
There are a number of additionally features 
implemented in the LEON pipeline to control the 
data transfer between LEON and XPP. 
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Figure S: LEON-to-XPP dual-clock FIFO 

When LEON tries to write to an IO-Port containing 
a full FIFO or read from an IO-Port containing an 
empty FIFO a trap is generated. This trap can be 
handled through a trap handler. There is a further 
mechanism - pipeline-holding - implemented, to 
allow LEON holding the pipeline and wait for free 

? .FIFO space during XPP write access respectively 
wait for a valid FIFO value during XPP read access. 
When using pipeline-holding the software developer 
has to avoid reading from an IO-Port with empty 
FIFO while the XPP, respectively the XPP input IO- 
Ports, contains no data to produce outputs. In this 
case a deadlock will occur and the complete system 
has to be reseted. 

. XPP can generate interrupts for the LEON when 
trying to read a value from an empty FIFO port or to 
write a value to a full FIFO port The occurrence of 
interrupts indicates, that the XPP array cannot 
process the next step because it has either no input 
values or it cannot output the result value. The 
interrupts generated by the XPP are maskable. 
The interface provides information about the FIFOs. 
LEON can read the number of valid values the FIFO 
contains. 
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The interface to the XPP appears to the LEON as a 
set of special registers. (Figure 6). These XPP 
registers can be categorized in communication- 
registers and status registers. 



contains a clock frequency ratio between LEON and 

. u y JH? tmg mis re S ister L£ ON software can 
set the XPP clock relative to LEON clock This 
allows to adapt the XPP clock frequency to me 
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Figure 6: Extended LEON Instruction Pipeline 



For data exchange the XPP communication registers 
are used. Since XPP provides three different types 
of communication ports, there are also three types of 
communication registes, whereas every type is 
sphtted into an input part and an output part: 
The data for the process are accessed through XPP 
data registers. The number of data input and data 
output ports as well as the data bitwidth depends on 
the implemented XPP array. 
XPP can generate and consume events. Events are 
one bit signals. The number of input events and 
output events depends on the implemented XPP 
array again. 

Configuration of the XPP is done through the XPP 
configuration register. LEON reads the required 
configuration value from a file - stored in his 
system RAM - and writes it to the XPP 
configuration register. 

There are a number of XPP status register 
implemented to control the behavior and get status 
information of the interface. Switching between the 
usage of trap handling and pipeline holding can be 
done ,n the hold register. A XPP clock register 



^required XPP performance and consequently to 
influence the power consumption of the system 

vpp £ w 1° m Cl ° ck re S ister *™ off the 
At last there is a status register for every FIFO 
containing the number of valid values actually 
available in the FIFO. y 
This status registers provides a maximum of 
telxibility m communication between LEON and 
XPP and enables different communication modes- 
- If there is only one application running on the 
system at the time, software may be developed 
m pipeline-hold mode. Here LEON initiates 
data read or write from respectively to XPP If 
there is no value to read respectively no value 
to write, LEON pipeline will be stopped until 
read or write is possible. This can be used to 
reduce power consumption of the LEON part 
In interrupt mode, XPP can influence the LEON 
program flow. Thus, the IO-Ports generates an 
interrupt depending on the actual number of 
values available in the FIFO. The 
communication between LEON and XPP as 
done in interrupt service routines. 



206 



WO 2005/010632 



PCT/EP2004/006547 



Polling mode is the classical way to access the 
XPP. Initiated by a timer-event LEON reads all 
XPP ports containing data and writes all XPP • 
ports containing free FIFO space. Between 
these phases LEON can compute other 
calculations. 

It is anytime possible to switch between this 
strategies within one application. 
The XPP is delivered containing a configuration 
manager to handle configuration and reconfiguration 
of the array. In this concept the configuration 
manager is dispensable because the configuration as 
J^lLf 8 .™* reco nfiguration is controlled by the 
LEON through the XPP configuration register. All 
XPP configurations used for an application are 
stored in the LEON's system RAM. 



the assembler of the binutils has been extended by a 
number of instructions according to the 
implemented instruction subset. The new 

SPaTc W 5 *? Same ™ em ° niC 38 * e re § ular 
sPARC V8 load, store, read and write instructions 

Only the new XPP registers have to be used i 

source respectively target operand. Since the 

mod.ficat.ons of LECCS are straightforward 

extensions, the cross compiler system is backward 

compatible to the original version. The availability 

of the source code of LECCS has allowed to extend 

thetools by the new XPP operations in the described 

The development of the XPP algorithms have to be 
done with separate tools, provided by PACT Corp. 



4. Tool and Compiler Integration 

The LEON's SPARC 8 instruction set [1] was 



extended by a new subset of instructions to make the 
new XPP registers accessable through software 
inese instructions are based in the SPARC 

x?o f0rmat but a* 5 not conf °™ to the 
orAKC V8 standard. Corresponding to the SPARC 
conventions of a load/store Architecture the 
instruction subset can be splitted in two general 
types. Load/store 
instructions can exchange 
data between the LEON 
memory and the XPP 
communication registers. 
The number of cycles per 
instruction are similar to the 
standard load/store 
instructions of the LEON. 
Read/write instructions are 
used for communications 
between LEON registers. Since the LEON register- 
set is extended by the XPP registers the read/write 
instructions are extended also to access XPP 
registers. Status registers can only be accessed with 
read/wnte instructions. Execution of arithmetic 
instructions on XPP registers is not possible. Values 
have to be written to standard LEON registers 
before they can be target of arithmetic operations, 
ine complete system can still operate any SPARC 
V8 compatiple code. Doing this, the XPP is 
completely unused. 

The LEON is provided with the LECCS cross 

T^ ,Ie ^u- yStem [9] Standin S under *e terms of 
tht " -, 1S SyStem consists of modified versions of 

bZZZT* 2 u l Wd SCC 2952 - To make *» new 
instruction subset available to software developers, 



5. Application Results 



As a first analysis application a inverse DCT applied 
to 8x8 pixel block was implemented For all 
simulations we used 250 MHz clock frequency for 
blTX ° CeSS ° r T d 50 MHZ c,ock frequency for 

xpp acce,erates 





LEON alone 


JUBON with XPP 
in IRQ Mode 


LfcON with XPP 
in Poll Mode 


UiON with XPP 
in Hold Mode . 


Configuration 
of XPP 




71.308 ps 
17.827 cycles 


84.364 ns 
21.091 cycles 


77.976 ns "' 
19.494 cycles 


2DiDCT(8x8) 


14.672 ns 
3.668 cycles 


3.272 ns 
818 cycles 


3.872 ns 
968 cycles 


3.568 ns 
892 cycles 



Table 1 Performance on IDCT (8x8) 
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Figure 7 Computation Time of IDCT (8x8) 
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factor four, depending on the communication mode. performance boost of this concept against the 

However XPP has to be configured before standalone LEON will be increased, 

computing the IDCT on it. Table 1 also shows the • 
configuration time for this algorithm. As shown in 

6. Conclusion 
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Figure 7, the benefit brought by XPP rises with the 
number of IDCT blocks computed by it before 
reconfiguration, so the number of reconfigurations 
during complex algorithms should be minimised. 
A first complex application implemented on the 
system is MPEG-4 decoding. The optimization of 
the algorithm partitioning on LEON and XPP is still 
under construction. In Figure 8 the blockdiagram of 
the MPEG-4 decoding algorithm is shown. Frames 
with 320 x 240 pixel was decoded. LEON by using 
SPARC V8 standard instructions decodes one frame 
in 23,46 seconds. In a first implementation of 
MPEG-4 using the XPP, only the IDCT is computed 
by XPP, the rest of the MPEG-4 decoding is still 
done with LEON. Now, with the help of XPP, one 
frame is decoded in 17,98 s. This is a performance 
boost of more then twenty percent. Since the XPP 
performance gain by accelerating the iDCT 
algorithm only is very low in the moment, we work 
on XPP implementations of Huffmann-decoding, 
dequantisation and prediction-decoding. So the 



Today, the instruction datapaths of modem 
microprocessors reach their limits by using static 
instruction sets, driven by the traditional von 
Neumann or Harvard architectural principles. A way 
out of these limitations is a dynamic reconfigurabie 
processor datapath extension achieved by 
integrating traditional static datapaths with the 
coarse-grain dynamic reconfigurabie XPP- 
architecture (eXtreme Processing Platform). 
Therefore, a loosely asynchronous coupling 
mechanism of the given instruction datapath has 
been developed and integrated onto a CMOS 0.13 
urn standard, cell technology from UMC. Here, the 
; -SPARC compatible LEON RISC processor is used, 
whereas its static pipelined instruction datapath has 
been extended to be configured and personalized for 
specific applications. This compiler-compatible 
instruction set extension allows a various and 
efficient use, e.g. in streaming application domains 
like MPEG-4, digital filters, mobile communication 
modulation, etc. The introduced coupling technique 
by .flexible dual-clock FIFO interfaces allows 
asynchronous concurrency and adapting the 
frequency of the configured XPP datapath 
dependent on actual performance requirements, e.g. 
for avoiding unneeded cycles and reducing power 
consumption. 

As represented above; the introduced concept 
combines the flexibility of a general purpose 
microprocesser with the performance and power 
consumption of coarse-grain reconfigurabie 
datapath structures, nearly comparable to ASIC 
performance. Here, two programming and 
computing paradigms (control-driven von Neumann 
and transport-triggered XPP) are unified within one 
hybrid architecture with the option of two clock 
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domains. The ability to reconfigure the transport- 
triggered XPP makes the system independent from 
standards or specific applications. This concept- 
opens potenial to develop multi-standard 
communication devices like software radios by 
using one extended processor architecture with 
adapted programming and compilation tools. Thus 
new standards can be easily implemented through 
software updates. The system is scalable during 
design rime through the scalable array-structure of 
the used XPP extension. This extends the range of 
suitable applications from products with less 
multimadia functions to complex high performance 
systems. 
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In coupling the XPP or any other data processing array having a number of 
preferably coarse grain cells to a conventional ( that is sequential and/or von 
Neumann-) processor /microcontroller design, a number of op code instructions may 
be added to the instructions set of the conventional processor. A non-limiting 
example is given below and it will be obvious to the average skilled person that it is 
not intended to limit the invention but disclose certain aspects thereof in more detail, 
the aspects being of more or less importance. For example, it may be the case that 
other bit lengths than indicated for instructions are used. It is also to be understood 
that the mnemonics might be changed and that in certain cases additional 
instructions and/or operations might be useful whereas in other cases or for other 
cases a subset of the instructions indicated below might be useful as well. For 
example, it is easily possible to combine one or more XPPor any other reconfigurable 
device or set or group of identical or different devices, in particular runtime 
reconfigurable and/or coarse grain devices, FPGA and or data streaming processors 
with any design other than the LEON processor and/or a processor using SPARC 
instructions. Also, the use of the instruction set is not limited to certain compiling 
algorithms although the compiling techniques disclosed in other parts of the present 
invention are very useful. It is to be noted that one preferred way of using the XPP or 
other reconfigurable device or set or group of identical or different devices, in 
particular runtime reconfigurable and/or coarse grain devices, FPGA and or data 
streaming processors coupled to a design such as the LEON processor and/or other 
conventional processor is the use of macro libraries so that predefined configurations 
can be instantiated and /or called as subroutines. These libraries may be 
automatically compiled and/or the configurations corresponding thereto may be set 
up by hand. This being noted, with respect to additional op-code instructions 
the following is noted: 



All additional instructions refer to format 3 of the SPARC instruction set, the op index 
being 3. The SPARC specification uses this format for the declaration of memory 
accesses. As in the original instruction set a plurality of op-codes had not been 
implemented, there was an opportunity to use the free fields for dedicated purposes. 
Also, it was possible to ensure completeness of instructions; for example, no memory 
access instruction is located inbetween arithmetic instructions. 
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Overview over the SPARC instruction format 3 



op 


rd 


op3 


rs1 


i=0 


Asi 


rs2 


op 


rd 


op3 


rs1 


i=1 


simm13 




op 


rd 


op3 


rs1 


Opf 




rs2 


31 


29 


24 


18 


13 


12 


4 


0 



Here, the abbreviations have the following meaning: 

• rd: This field is five bit long. It contains the address of the source or target 

register, arithmetic and for Load-/Store-operations. 

• op3: This field is six bit long. Together with the op field it builds the instructions. 

• rs1: This field is five bit long. It contains the first operand of an ALU-operation. 

• opf. This field is nine bit long and contains the instructions of a floating point 

operation. 

• /: This is a one-bit-field selecting the second operand for arithmetic or Load-/ 

Store-operations respectively. In case i=l, the operand is the content of 
simm13, otherwise the operand is the content of rs2. 

• asi: This field is eight bit long and indicates the address space which is 

accessed by Load-/Store-operations. 

• sim13: This field is thirteen bit long and contains the second operand of an 

arithmetic and/or Load-/Store-operation, the operand having a sign (+, -). 

• rs2: This field is five bit long and corresponds to the operand of an arithmetic 

and/or Load-/Store-operation respectively. It does not have a sign (+, -). 

Overview over additional instructions 



Opcode 


Meaning " ~ 


privileged 


stxppd 


write word from memory to an XPP data register 


no 


Idxppd 


Load word from memory to an XPP data register 


no 


stxppe 


Write word from XPP event register into memory 


no 


Idxppe 


Load word from memory into XPP event register 


no 


Idem ] 


Load word from memory into CM register 


yes 
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stem 

cptoxppd 

cptoleond 

cptoxppe 

cptoleone 

cptocm 

cptoleoncm 

cptoleonsdi 


Write word from cm register into memory 

Copy a word from a LEON register into an XPP data register 

Copy a word from an XPP register into a LEON data reqister 

Copy a word from a LEON register into an XPP event reqister 

Copy a word from an XPP register into a LEON event reqister 

Copy a word from a LEON register into a CM register 

Copy a word from a CM register into an LEON r<am<*t»r 

Copy a word from the status register of an XPP data input 
register into a LEON register 


yes 
no 
no 
no 

IIU 

yes 
yes 
no 


cptoleonsdo 


copy a word from the status register of an XPP data output 
regtster into a LEON register 


no 


cptoleonsei 


Copy a word from the status register of an XPP event input 
register into a LEON register P 


no 


cptoleonseo 

wrclkr 

wroffsetr 

rdclkr 

rdoffsetr 

rdtrapr 


oopy a wora from the status regiser of an XPP event output 
register into a LEON register 

Write into a clock register to determine clock ratio LEON-XPP 

VVnts into m^mnn/ nffoat r /-*#-• lr*4-^* «. x__ ; 

v vi ue iiuo memory orrset register for memory mapped mode 

Read clock register for clock ration LEON-XPP " 

Read memory offset register for memory mapped mode 

Read register with information about XPP trap " ~ 


no 

yes 
yes 
yes 
yes 
yes 



Data transfer between LEON and XPP 



Opcode op3 



cptoxppd 



Operation 



cptoleond 



cptoxppe 



cptoleone 



101110 



101111 



110010 



110011 



copy a word from a LEON register into an XPP data register 



copy a word from an XPP register into a LEON data register 



copy a word from a LEON register into an XPP event register 



uopy a word from an XPP register into a LEON event register 



Format (3): 



11 



rd 



31 29 



op3 
24~ 



Rs1 
l8~ 



rxpp(opf) 
13 



12 



rs2 
4 



0 



Assembler Syntax: 
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cptoxppd 


reflrd, reqrvnn — — ■ . 


cptoleond 


regixpp. reg^ — — • — 


cptoxppe 


regrd, reg^pp 


cptoleone 


reg wpp> reg rd — 



Description 



* 3 W ° rd fr ° m ^ t0 the data register r ^ of Xp P archtecture. 
CPTOLEOND loads a word from a data register r[/xpp] of XPP architecture to r[rd] 
CPTOXPPE loads a word from r[rd] to event register r[rx PP ] of XPP architecture 
CPTOLEONE loads a word from event register r[ncpp] of XPP architecture to r[rd\ 



Traps: 

xpp_readaccess_error 
xpp_writeaccess_error 
xpp_regnotexist_error 



Data transfer between LEON and CM 



Opcode 


op3 


Operation 








cptocm 


110110 


Load word from memory into CM register 




cptoleoncm 


110111 


Load word from CM register into LEON register 


Format (3): 












TTJrd 




|op3 


Rs1 


rcm(opf) I rs2 


31 29 




24 


18 


13 12 4 


n 



Assembler Syntax: 



cptocm 


regrd, reg rem , 


cptoleoncm 


regrcm. reg^ — 



Description: 

CPTOCM loads a word of r[rd] into a register r[rcm] of CM. 
CPTOLEONCM loads a word from register r[rcm] of CM to r[rd]. 

Traps: 
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privilegedjnstruction 

cm_writeaccess_error 

cm_regnotexist_error 



Data transfer between XPP and memory 

Opcode 



stxppd 
Idxppd 



stxppe 
Idxppe 



op3 
100010 



Operation 



100011 



100110 



100111 



Store word from an XPP data register into me7 ^o7y~ 
Load word from memory into an XPP da ta register 
store word from an XPP event register into memon7 



Load word from memory into an XPP event register" 



Format (3): 



op 


rxpp(rd) 


op3 


Rs1 


i=0 


asi 


rs2 




op 


rxpp(rd) 


op3 


Rs1 


i=1 


simm13 




31 


29 


24- 


18 


13 


12 


4 


0 



Assem bler Syntax: 
stxppd 



Idxppd 



stxppe 



Idxppe 



regrxpp, [adres sej 
[adresse], reg^p 
re 9ixpp, [adresse] 



[adresse], reg^pp 



Description: 

STXPPD / STXPPE writes a word from register rxpp into memory 
LDXPPD / LDXPPE loads a word from memory into register rxpp 
Jr)s^m13 ?dreSS iS Ca ' CUlated 38 " r ^ + nrs2T in case that i 

Traps: 

xpp_readaccess_error 
xpp_writeaccess_error 

xpp_regnotexist_error 

mem__address_not_aligned 



= 0, otherwise 



Data transfer between CM and memory 
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Opcode 


op3 


Operation " — 






Idem 


101010 


Load word from memory into a CM register 






stem 


101011 


write word from CM register into memory ~ 



Format (3): 



op 


rcm(rd) 


Op3 


rs1 


i*0 


asi 


Irs2 


op 
31 


rcm(rd) 
29 


Op3 

24 


rs1 
in 


i=1 


simm13 



Assembler Syntax: 



Idem 



stem 



reg rC m, [adresse] 



[adresse], reg 



rem 



Description: 

STCM writes a word from register rem into memory. 
LDCM loads a word from memory into register rem. 

Jr[S^simm13 d " dreSS * Ca ' CU ' ated as ^^r[rs2T in case that i 



0, otherwise as 



Traps: 

privileged_instruction 

cm_readaccess_error 

cm_writeaccess_error 

cm_regnotexist_error 

mem_address_not_aligned 

Data transfer from status registers to LEON 



Opcode 
cptoleonsdi 


Op3 
101100 


Operation 1 

Copy a word from the status register of an XPP data input 
register into a LEON register P 


cptoleonsdo 


101101 


Copy a word from the status register of an XPP data output 
register into a LEON register 


cptoleonsei 


110000 


uopy a word from the status register of an XPP event input 
register into a LEON register 


cptoleonseo 


110001 


oopy a wore tram the status register of an XPP event output 
register into a LEON register 


Format (3): 
11 rd 


°P 3 rs1 rst(opf) " "rs? 


31 29 


24 18 13 12 4 0 
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Assembler Syntax: 



cptoleonsdi 


reg re t, regrd 




cptoleonsdo 


regret, reg,d 


cptoleonsei 


regret, regm 


cptoleonseo 


regret, reg r d 



Description: 



CPTOLEONSDI loads a word from the status register r[rsf\ of a data input register 
into the register r[rd] of the LEON processor. 'ey^ier 

CPTOLEONSDO loads a word from the status register r[rsf\ of a data output reqister 
into the register r[rd] of the LEON processor. >«ai^er 

CPTOLEONSEI loads a word from the status register r[rs(\ of an event input register 
into the register r[rd]ofthe LEON processor. 'eg'^er 

CPTOLEONSEO loads a word from the status register r[rcf] of an event outout 
register into the register r[rd\ of the LEON processor. 



Traps: 

st_readaccess_error 
st_reg notexist_error 



Data transfer between XPP configuration register and LEON 



Opcode op3 Operation " 1 


wrciKr 


111000 


Write clock ratio LEON-XPP into clock register 


wroffsetr 


111001 


Write into memory offset register for memory mapped mode 


rdclkr 


111010 


Read clock register for clock ratio LEON-XPP 


rdoffsetr 


111011 


Read memory offset register for memory mapped mode 


rdtrapr 


111110 


Read registers with informationen about XPP trap 



Format (3): 

11 l rd l°P3 | unused ' I Unused I unused 
31 29 24 18 13 12 4 rj 



Assembler Syntax: 
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wrclkr 



wroffsetr 



rdclkr 



rdoffsetr 



rdtrapr 



rrd, %clkr 



r r d, %memo ffsetr 
%clkr rrc 



%memoffsetr, r rd 



%trapr r rd 



Description: 



WRCLKR loads a word from the register r\rd\ into th* rio^u . 

register contains the value 0 the XPP unit 7« iJ2S- * re 9 ,ster - »n case the 

f h ?k 3 ^ fr ° m * e regiS,er in, ° *• me ™<Y °feet register 
RDOFFSETR ? H 6 * fr ° m ° IOCk re3iS,er into * e »*■»« rW. 
RDTRAPR loads the content of the trap information register into the register r[«fl 
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UxX vim*./ VctsiaAeAS a/it pate, yH ■ 

The following figure shows another example of a preferred cou- 
pling between a conventional (von-Neumann-like and /or sequen- 
tial ) processor and an array of processing elements ^configu- 
rable at runtime and/or on the fly, the figure referring to an 
XPP by way of example only, although, as in all parts of the 
present invention, aspects of the disclosure might in some 
cases be better understood by referring to publications that 
show and explain the functioning of an XPP in more detail. 

Here, a plurality of details is described in other parts of 
the present application as will be obvious between the simila- 
rity offigures, yet some particular aspects showing preferred 
implementations and /or embodiments and or aspects can be 
found in more detail in the following figure. 

Now, as for the figure, the attention is drawn to the folio- 
wing facts: 



A coupling may use either one of two different paths, both 
paths can be implemented as an alternative, although in the 
preferred embodiment, these paths are implemented simultane- 
ously . 

The first path transfers data between the ALU (or other part, 
particularly in the data path ) of the conventional proceesor 
and the XPP is dps-like and is thus intended for low-volume 
data transfer. As shown, it is possible to transfer data from 
the xpp array, preferably via FIFOs and, preferably a MUX al- 
lowing selction of either an XPP event data or an XPP result 
data in response to a setting of the MUX preferably by either 
the processor or the XPP to one or a number of operand inputs 
of the ALU or other units in the data path for ALU operand in- 
put such as MUXes or the like. It. is to be noted" that a number 
of different data can be transferred in that way, such as sta- 
tus information, flags and the like as well as arithmetic da- 
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ta. This transfer can be either from the ALU or a unit down- 
stream therefrom in the datapath of the conventional proces- 
sor. Also, data other than operand data, such as event and/or 

th "pHoT 5 "" 1 " 9 internal stata3 oan be — 

the XPP to the conventional processor it is coupled. 

The second data path is to and /or fro m the cache and it is to 
be noted that a coupling may be effected to both the D- and 
/or the l-cache. The coupling to the I-Cache is advantageous 
so as to allow for a very fast reconfiguration of the proes- 
srng array due to the possibility to handle only a minute 
amount of data within the seguential processor while allowing 
for large configuration data by. Here, not the entire configu- 
ration must be transferred through the ALU or other conventio- 
nal unxt. Reconfiguration can rely on either the conventional 
processor sending configurations or, more preferably, configu- 
ration ioad instructions (e.g. the adress of a configuration 
or macro needed, to the array and/or a configuration unit such 
as a configuration manager coupled thereto, e.g. a filmo and 
/or can rely on the array itself reguesting reconfiguration 
for example after the instantiation of a first configuration 

or th"* I 1 " 9 " MaCr ° th3C beSn = all9d " - subroutine 
or the like by the conventional processor. With respect to the 
data coupling to the D-cache or other (large) memory units 
such as memory banks, it is possible to allow for data strea- 
ming, e.g. using load/store configurations within the array as 
have been described elsewhere. Tt is possible to implement va- 
rious methods of data streaming units such as DMA, cachecon- 
trollers dedicated to operate together with the array and the 
irke. It is to be noted that within the data path for this 
ooupling, no register needs be present so that block move com- 
mands are easily implementable. 

One of the advantages of the preferred coupling according to 
the invention as described in one aspect thereof is that it i 
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effected via the instruction pipeline of the conventional 
processor design. The conventional processor and the array can 
be decoupled does not rely on registers, need not handle every 
single operand separately and also allows for a decoupling of 
processor and array by the use of FIFOs, the later aspect 
being advantageous in that both devices may be operated asyn- 
chronously, that is, it is not absolutely necessary in all and 
every case for one unit to wait until the other has finished a 
certain task. In contrast, it is sufficient to synchronize the 
two units by methods such .as interrupt routines, and or pol- 
ling. 



Also, the coupling shown is preferrable over those known in 
the art since it allows for coupling into both tha data and 
the control flow. 

With respect to other parts of the present application, it is 
noted that whereas this part refers to FIFOs used in the data 
path to effect the data coupling, other parts, esp. Those dea- 
ling in more detail with certain compiler techniques refer to 
the use of I-RAMS (internal RAMS) to effect the decoupling, it 
will be obvious that a FIFO used sin the XPP-Data input path, 
XPP data event input path and /or XPP config path might be re- 
placed by an I -ram or that both I-RAMS and FIFOs might be used 
simultaneously . 

Where reference is being made to event data, it is to be noted 
that in simple cases these will be singel bit data, but that 
it is possible to use event vectors as well, that is, event 
data having more than one bit. 
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